Monitoring System Health

Monitoring the health of the "For You" feed algorithm is critical for ensuring low-latency responses and high-quality, fresh content retrieval. The system exposes real-time metrics and logging capabilities across its core components: Home Mixer, Thunder, and Phoenix.

This guide walks you through accessing these metrics and interpreting key indicators of system health.

Accessing the Metrics Endpoint

Both the Home Mixer and Thunder services include a built-in HTTP server to export Prometheus-compatible metrics.

Configure the Metrics Port: When starting the services, use the --metrics-port flag.

# Example: Starting Home Mixer with metrics on port 9090
./home-mixer --grpc-port 50051 --metrics-port 9090 --reload-interval-minutes 60

Scrape the Endpoint: Once the service is running, metrics are available at http://<host>:<metrics_port>/metrics.

Monitoring Retrieval Freshness (Thunder)

The Thunder service (In-Network retrieval) is responsible for fetching the most recent posts from accounts a user follows. You should monitor these metrics to ensure the "In-Network" feed doesn't go stale.

Key Freshness Metrics

GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS: Tracks the time difference between the current time and the most recent post found.
GET_IN_NETWORK_POSTS_FOUND_TIME_RANGE_SECONDS: The delta between the oldest and newest post in a retrieved batch.
GET_IN_NETWORK_POSTS_FOUND_REPLY_RATIO: Monitor this to ensure the feed isn't becoming over-saturated with replies versus original posts.

Tracking Kafka Consumption

Thunder relies on Kafka to ingest new post events. If ingestion fails, retrieval freshness will drop.

Failed Parses: Monitor KAFKA_MESSAGES_FAILED_PARSE. A spike here indicates a schema mismatch or data corruption in the upstream tweet event firehose.
Processing Time: Check BATCH_PROCESSING_TIME to ensure the Kafka consumer is keeping up with the volume of X’s global post stream.

Monitoring Ranking Latency (Phoenix & Home Mixer)

Ranking via the Grok-based transformer is the most computationally expensive part of the pipeline. Monitoring latency here ensures the "For You" feed remains responsive.

Latency Checkpoints

GET_IN_NETWORK_POSTS_DURATION: Total time taken for the Thunder service to return candidates.
GET_IN_NETWORK_POSTS_DURATION_WITHOUT_STRATO: Latency excluding the time spent fetching the user's "Following" list. If this value is high, the bottleneck is the internal PostStore lookup.

Request Throughput and Rejection

The system uses semaphores to prevent cascading failures during traffic spikes.

In-Flight Requests: Monitor IN_FLIGHT_REQUESTS to see how many active ranking tasks are being processed.
Rejected Requests: If REJECTED_REQUESTS increases, it means the system has reached its max_concurrent_requests limit (configured via CLI) and is shedding load to protect service stability.

Using the Stats Logger

The Thunder service includes a background task that logs the state of the in-memory PostStore at regular intervals.

How to Enable Stats Logging

In the main.rs for Thunder, the stats logger is automatically started when the service is in serving mode:

if args.is_serving {
    // Start stats logger to output memory and storage health to stdout/logs
    post_store.start_stats_logger();
    
    // Start auto-trim task to remove posts older than retention period
    post_store.start_auto_trim(2); 
}

What to Look for in Logs

Retention Health: The logger will report the number of posts currently held in memory. If this number hits a plateau while latency rises, consider reducing post_retention_seconds.
Auto-Trim Activity: Ensure the auto-trim task runs every 2 minutes. If it fails, the service may encounter Out-of-Memory (OOM) errors as old posts are not evicted.

Summary of Critical Alerts

Set up alerts on your monitoring dashboard (e.g., Grafana) for the following thresholds: