Monitoring Pipeline Health
Monitoring Pipeline Health
Ensuring the stability and performance of the For You feed requires active monitoring of the retrieval and ranking pipelines. The x-algorithm system exposes a comprehensive suite of metrics via Prometheus-compatible endpoints.
This guide walks you through accessing these metrics and interpreting key indicators to maintain a healthy recommendation ecosystem.
1. Accessing the Metrics Endpoint
Both Home Mixer and Thunder services expose a dedicated metrics port. By default, these are configured via the CLI at startup.
- Locate the metrics port: Look for the
--metrics-portflag in your service configuration or startup logs.# Example: Starting Home Mixer with metrics on port 9090 ./home-mixer --grpc-port 50051 --metrics-port 9090 - Query the endpoint: Use
curlor point your Prometheus instance to the/metricspath.curl http://localhost:9090/metrics
2. Monitoring System Load and Rejections
The system uses semaphores to limit concurrent requests and prevent cascading failures. Monitoring these metrics is the first step in identifying bottlenecks.
REJECTED_REQUESTS: Increments when the system is at maximum capacity and cannot accept new requests. If this is rising, consider increasing--max-concurrent-requestsor scaling your instances.IN_FLIGHT_REQUESTS: Shows the current number of active requests being processed. Use this to track real-time utilization against your hardware limits.GET_IN_NETWORK_POSTS_DURATION: Measures the total latency of the In-Network (Thunder) retrieval stage.
3. Tracking Candidate Freshness and Diversity
Because the "For You" feed prioritizes real-time relevance, you must monitor the "freshness" of the candidates being retrieved from the PostStore.
Freshness Metrics
Use the following metrics to ensure the pipeline isn't serving stale content:
GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS: Tracks the time elapsed since the most recent post in a retrieval batch. High values indicate a lag in the Kafka ingestion pipeline.GET_IN_NETWORK_POSTS_FOUND_TIME_RANGE_SECONDS: The delta between the oldest and newest post in a batch. This helps you understand the temporal depth of the candidate set.
Diversity and Quality Metrics
GET_IN_NETWORK_POSTS_FOUND_UNIQUE_AUTHORS: Tracks how many distinct authors are represented in the candidate set. A drop here might indicate an issue with the "Follow" graph hydration.GET_IN_NETWORK_POSTS_FOUND_REPLY_RATIO: Monitors the balance between original posts and replies. This is critical for ensuring the feed doesn't become over-saturated with conversational threads.
4. Observing Pipeline Storage Health
The Thunder service manages an in-memory PostStore with an automated cleanup task. Monitoring the "trimming" process ensures you aren't leaking memory while retaining enough history for the Phoenix model.
- Check Retention Logs: The system logs the status of the auto-trim task every 2 minutes (default).
Started PostStore auto-trim task (interval: 2 minutes, retention: 2.0 days) - Monitor Batch Processing:
KAFKA_MESSAGES_FAILED_PARSE: If this counter increases, it indicates a schema mismatch between your producers and the algorithm's deserializers.BATCH_PROCESSING_TIME: Tracks how long it takes to ingest and index posts from Kafka into the in-memory store.
5. Phoenix Model Inference Health
For the ranking stage, focus on the transformer's attention and embedding health.
- Attention Masking: Verify that candidate-to-candidate attention is blocked to prevent information leakage during scoring. You can validate this by checking the
make_recsys_attn_maskoutput in unit tests. - Embedding Scale: If you are debugging ranking quality, monitor the
CandidateToweroutputs. The model uses L2-normalization for its representations; ensure that the norms of your candidate embeddings consistently approach1.0.
Summary Checklist for Health Dashboards
| Metric Category | Key Metric Name | Alert Condition |
| :--- | :--- | :--- |
| Availability | REJECTED_REQUESTS | > 0 for 5 minutes |
| Performance | GET_IN_NETWORK_POSTS_DURATION | P99 > 200ms |
| Data Flow | KAFKA_MESSAGES_FAILED_PARSE | > 1% of traffic |
| Relevance | GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS | Mean > 3600s (1 hour) |
| Variety | GET_IN_NETWORK_POSTS_FOUND_UNIQUE_AUTHORS | Significant drop from baseline |