Monitoring System Health
Monitoring the health of the "For You" feed algorithm is critical for ensuring low-latency responses and high-quality, fresh content retrieval. The system exposes real-time metrics and logging capabilities across its core components: Home Mixer, Thunder, and Phoenix.
This guide walks you through accessing these metrics and interpreting key indicators of system health.
Accessing the Metrics Endpoint
Both the Home Mixer and Thunder services include a built-in HTTP server to export Prometheus-compatible metrics.
- Configure the Metrics Port: When starting the services, use the
--metrics-portflag.# Example: Starting Home Mixer with metrics on port 9090 ./home-mixer --grpc-port 50051 --metrics-port 9090 --reload-interval-minutes 60 - Scrape the Endpoint: Once the service is running, metrics are available at
http://<host>:<metrics_port>/metrics.
Monitoring Retrieval Freshness (Thunder)
The Thunder service (In-Network retrieval) is responsible for fetching the most recent posts from accounts a user follows. You should monitor these metrics to ensure the "In-Network" feed doesn't go stale.
Key Freshness Metrics
GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS: Tracks the time difference between the current time and the most recent post found.GET_IN_NETWORK_POSTS_FOUND_TIME_RANGE_SECONDS: The delta between the oldest and newest post in a retrieved batch.GET_IN_NETWORK_POSTS_FOUND_REPLY_RATIO: Monitor this to ensure the feed isn't becoming over-saturated with replies versus original posts.
Tracking Kafka Consumption
Thunder relies on Kafka to ingest new post events. If ingestion fails, retrieval freshness will drop.
- Failed Parses: Monitor
KAFKA_MESSAGES_FAILED_PARSE. A spike here indicates a schema mismatch or data corruption in the upstream tweet event firehose. - Processing Time: Check
BATCH_PROCESSING_TIMEto ensure the Kafka consumer is keeping up with the volume of X’s global post stream.
Monitoring Ranking Latency (Phoenix & Home Mixer)
Ranking via the Grok-based transformer is the most computationally expensive part of the pipeline. Monitoring latency here ensures the "For You" feed remains responsive.
Latency Checkpoints
GET_IN_NETWORK_POSTS_DURATION: Total time taken for the Thunder service to return candidates.GET_IN_NETWORK_POSTS_DURATION_WITHOUT_STRATO: Latency excluding the time spent fetching the user's "Following" list. If this value is high, the bottleneck is the internalPostStorelookup.
Request Throughput and Rejection
The system uses semaphores to prevent cascading failures during traffic spikes.
- In-Flight Requests: Monitor
IN_FLIGHT_REQUESTSto see how many active ranking tasks are being processed. - Rejected Requests: If
REJECTED_REQUESTSincreases, it means the system has reached itsmax_concurrent_requestslimit (configured via CLI) and is shedding load to protect service stability.
Using the Stats Logger
The Thunder service includes a background task that logs the state of the in-memory PostStore at regular intervals.
How to Enable Stats Logging
In the main.rs for Thunder, the stats logger is automatically started when the service is in serving mode:
if args.is_serving {
// Start stats logger to output memory and storage health to stdout/logs
post_store.start_stats_logger();
// Start auto-trim task to remove posts older than retention period
post_store.start_auto_trim(2);
}
What to Look for in Logs
- Retention Health: The logger will report the number of posts currently held in memory. If this number hits a plateau while latency rises, consider reducing
post_retention_seconds. - Auto-Trim Activity: Ensure the auto-trim task runs every 2 minutes. If it fails, the service may encounter Out-of-Memory (OOM) errors as old posts are not evicted.
Summary of Critical Alerts
Set up alerts on your monitoring dashboard (e.g., Grafana) for the following thresholds:
| Metric | Condition | Recommended Action |
| :--- | :--- | :--- |
| REJECTED_REQUESTS | > 0 | Increase max_concurrent_requests or scale out instances. |
| KAFKA_MESSAGES_FAILED_PARSE | Spike in count | Check for upstream schema changes in tweet_events. |
| GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS | > 300s | Check Kafka consumer lag; the feed is delayed by 5+ minutes. |
| GET_IN_NETWORK_POSTS_DURATION | > 200ms | Optimize the PostStore query or reduce the max_posts_to_return config. |