The Challenge
Our client needed a real-time analytics platform to track user behavior across their SaaS application. The requirements were demanding:
- Process 5+ million events per day
- Real-time dashboard updates (<2 second latency)
- Zero data loss
- Query historical data efficiently
- Cost-effective infrastructure
Traditional approaches (writing directly to database, using heavy analytics platforms) wouldn't scale cost-effectively.
Technical Requirements
Scale:
- 5 million events/day initially
- Expected growth to 50 million/day within a year
- 10,000+ concurrent users
- Real-time aggregations
Reliability:
- 99.9% uptime
- Zero data loss
- Graceful degradation
Performance:
- <2 second dashboard updates
- <100ms API response times
- Efficient historical queries
The Architecture
Event Collection Layer
// EventController.php
public function track(Request $request)
{
$event = [
'event_type' => $request->input('event'),
'user_id' => $request->input('user_id'),
'properties' => $request->input('properties', []),
'timestamp' => now()->timestamp,
'session_id' => $request->input('session_id'),
];
// Write to Redis for immediate processing
Redis::lpush('events:queue', json_encode($event));
// Also publish for real-time listeners
Redis::publish('events:stream', json_encode($event));
return response()->json(['success' => true]);
}
Why This Approach:
- Redis is lightning-fast (50k+ writes/second)
- Non-blocking for the user
- Dual-write: queue for processing, pubsub for real-time
Processing Layer: Queue Workers
// ProcessAnalyticsEvent.php
class ProcessAnalyticsEvent implements ShouldQueue
{
use Queueable, SerializesModels;
public function handle()
{
while ($event = Redis::rpop('events:queue')) {
$data = json_decode($event, true);
// Batch events for efficiency
$this->batchEvent($data);
// Update real-time metrics
$this->updateMetrics($data);
// Write to long-term storage
$this->persistEvent($data);
}
}
private function batchEvent($data)
{
// Group events by minute for batch insert
$minute = floor($data['timestamp'] / 60) * 60;
$key = "batch:{$minute}";
Redis::lpush($key, json_encode($data));
Redis::expire($key, 120); // 2 minute window
}
private function updateMetrics($data)
{
// Increment counters in Redis
$hour = date('Y-m-d H:00:00', $data['timestamp']);
Redis::hincrby("metrics:{$hour}", $data['event_type'], 1);
Redis::hincrby("metrics:users:{$hour}", $data['user_id'], 1);
Redis::expire("metrics:{$hour}", 86400 * 7); // 7 days
}
private function persistEvent($data)
{
// Bulk insert every minute via scheduled job
// This reduces database load significantly
}
}
Scheduled Aggregation
// AggregateMetrics.php
class AggregateMetrics extends Command
{
public function handle()
{
// Run every minute
$this->aggregateMinuteMetrics();
// Run hourly
if (date('i') === '00') {
$this->aggregateHourlyMetrics();
}
}
private function aggregateMinuteMetrics()
{
$minute = floor(time() / 60) * 60;
$key = "batch:{$minute}";
$events = Redis::lrange($key, 0, -1);
if (empty($events)) return;
// Batch insert to PostgreSQL
DB::table('analytics_events')->insert(
array_map(fn($e) => json_decode($e, true), $events)
);
// Clean up Redis
Redis::del($key);
}
}
Real-Time Dashboard: WebSockets
// AnalyticsChannel.php
class AnalyticsChannel
{
public function subscribe($channel)
{
Redis::subscribe(['events:stream'], function ($message) {
$event = json_decode($message, true);
// Broadcast to WebSocket clients
broadcast(new AnalyticsEventReceived($event));
});
}
}
// Frontend: Real-time dashboard
Echo.channel('analytics')
.listen('AnalyticsEventReceived', (event) => {
// Update dashboard in real-time
updateMetric(event.event_type, 1);
updateUserCount();
});
Database Schema Design
Hot Data (Recent): PostgreSQL
-- Partitioned by day for efficient queries and maintenance
CREATE TABLE analytics_events (
id BIGSERIAL,
event_type VARCHAR(100) NOT NULL,
user_id BIGINT NOT NULL,
session_id VARCHAR(100),
properties JSONB,
created_at TIMESTAMP NOT NULL,
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Indexes for common queries
CREATE INDEX idx_events_type_time ON analytics_events(event_type, created_at);
CREATE INDEX idx_events_user_time ON analytics_events(user_id, created_at);
CREATE INDEX idx_events_properties ON analytics_events USING GIN(properties);
Cold Data (Historical): ClickHouse
For historical analysis beyond 30 days, we move data to ClickHouse:
-- ClickHouse table
CREATE TABLE analytics_events (
event_type String,
user_id UInt64,
properties String,
timestamp DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, user_id, timestamp);
Why ClickHouse:
- 100x faster for analytical queries
- 10x better compression
- Perfect for time-series data
- Handles billions of rows easily
Query Optimization
Real-Time Metrics (Last 24 Hours)
public function getRealtimeMetrics()
{
// Query Redis for the last 24 hours
$hours = 24;
$metrics = [];
for ($i = 0; $i < $hours; $i++) {
$hour = now()->subHours($i)->format('Y-m-d H:00:00');
$key = "metrics:{$hour}";
$hourMetrics = Redis::hgetall($key);
$metrics[] = [
'hour' => $hour,
'events' => $hourMetrics,
];
}
return $metrics;
}
Performance:
- Returns in <50ms
- No database queries
- Scales horizontally with Redis Cluster
Historical Queries (30+ Days)
public function getHistoricalTrends(Request $request)
{
$startDate = $request->input('start_date');
$endDate = $request->input('end_date');
// Query ClickHouse for historical data
$query = ClickHouse::select("
toStartOfHour(timestamp) as hour,
event_type,
count() as count
FROM analytics_events
WHERE timestamp BETWEEN ? AND ?
GROUP BY hour, event_type
ORDER BY hour
", [$startDate, $endDate]);
return $query;
}
Performance:
- Queries billions of rows in seconds
- Aggregations are blazing fast
- Cached results for common queries
Handling Scale
Load Balancing
# nginx.conf
upstream app {
least_conn;
server app1:8000;
server app2:8000;
server app3:8000;
}
upstream workers {
server worker1:8000;
server worker2:8000;
server worker3:8000;
}
Auto-Scaling Workers
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: analytics-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: analytics-worker
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Redis Cluster
For handling growth:
- 6-node Redis Cluster
- 16,000 slots distributed
- Automatic failover
- Read replicas for queries
Monitoring & Alerting
Key Metrics Tracked
// Application metrics
Prometheus::increment('events_received_total');
Prometheus::observe('event_processing_duration_seconds', $duration);
// Business metrics
Prometheus::gauge('active_users_count', $activeUsers);
Prometheus::histogram('query_latency_seconds', $queryTime);
Alerts Configuration
- alert: HighEventProcessingLatency
expr: event_processing_duration_seconds > 5
for: 5m
annotations:
description: Event processing is taking too long
- alert: RedisQueueBacklog
expr: redis_queue_length > 100000
for: 2m
annotations:
description: Redis queue is building up, scale workers
Results
Performance Metrics
- Event Ingestion: 15,000 events/second (3x requirement)
- Dashboard Latency: 0.8 seconds average (60% better)
- API Response Time: 45ms average (55% better)
- Zero Data Loss: ✓ Achieved through redundancy
Cost Efficiency
- Infrastructure Cost: $800/month
- Cost per Million Events: $5.33
- 70% cheaper than managed analytics platforms
- Scales linearly with usage
Reliability
- Uptime: 99.95% over 12 months
- Data Loss: 0 events lost
- Recovery Time: <5 minutes during incidents
Lessons Learned
What Worked Well
-
Redis as a Buffer
- Absorbed traffic spikes
- Simplified scaling
- Provided real-time capabilities
-
Batch Processing
- Reduced database load by 95%
- Improved throughput significantly
- Easier to optimize
-
Tiered Storage
- Hot data in PostgreSQL
- Cold data in ClickHouse
- Best of both worlds
What We'd Do Differently
-
Earlier ClickHouse Adoption
- Should have started with it
- Migration was complex
-
Better Monitoring from Day One
- Would have caught issues earlier
- Helps with capacity planning
-
More Comprehensive Testing
- Load testing earlier in development
- Better chaos engineering practices
Scalability Roadmap
Current: 5M events/day 6 Months: 20M events/day 12 Months: 50M events/day
Scaling Strategy:
- Horizontal scaling of workers
- Redis Cluster expansion
- ClickHouse cluster for cold storage
- Geographic distribution of processing
Conclusion
Building a real-time analytics platform at scale requires careful architectural decisions, the right tools for each layer, and a focus on both performance and reliability.
The key is to:
- Use Redis for real-time operations
- Batch operations to reduce database load
- Choose the right storage for hot vs cold data
- Monitor everything and automate scaling
- Plan for failure at every layer
The architecture we built handles millions of events daily, provides real-time insights, and costs a fraction of managed alternatives—all while maintaining 99.9%+ uptime.