Monitoring Pipeline Health

Ensuring the stability and performance of the For You feed requires active monitoring of the retrieval and ranking pipelines. The x-algorithm system exposes a comprehensive suite of metrics via Prometheus-compatible endpoints.

This guide walks you through accessing these metrics and interpreting key indicators to maintain a healthy recommendation ecosystem.

1. Accessing the Metrics Endpoint

Both Home Mixer and Thunder services expose a dedicated metrics port. By default, these are configured via the CLI at startup.

Locate the metrics port: Look for the --metrics-port flag in your service configuration or startup logs.

# Example: Starting Home Mixer with metrics on port 9090
./home-mixer --grpc-port 50051 --metrics-port 9090

Query the endpoint: Use curl or point your Prometheus instance to the /metrics path.
```
curl http://localhost:9090/metrics
```

2. Monitoring System Load and Rejections

The system uses semaphores to limit concurrent requests and prevent cascading failures. Monitoring these metrics is the first step in identifying bottlenecks.

REJECTED_REQUESTS: Increments when the system is at maximum capacity and cannot accept new requests. If this is rising, consider increasing --max-concurrent-requests or scaling your instances.
IN_FLIGHT_REQUESTS: Shows the current number of active requests being processed. Use this to track real-time utilization against your hardware limits.
GET_IN_NETWORK_POSTS_DURATION: Measures the total latency of the In-Network (Thunder) retrieval stage.

3. Tracking Candidate Freshness and Diversity

Because the "For You" feed prioritizes real-time relevance, you must monitor the "freshness" of the candidates being retrieved from the PostStore.

Freshness Metrics

Use the following metrics to ensure the pipeline isn't serving stale content:

GET_IN_NETWORK_POSTS_FOUND_FRESHNESS_SECONDS: Tracks the time elapsed since the most recent post in a retrieval batch. High values indicate a lag in the Kafka ingestion pipeline.
GET_IN_NETWORK_POSTS_FOUND_TIME_RANGE_SECONDS: The delta between the oldest and newest post in a batch. This helps you understand the temporal depth of the candidate set.

Diversity and Quality Metrics

GET_IN_NETWORK_POSTS_FOUND_UNIQUE_AUTHORS: Tracks how many distinct authors are represented in the candidate set. A drop here might indicate an issue with the "Follow" graph hydration.
GET_IN_NETWORK_POSTS_FOUND_REPLY_RATIO: Monitors the balance between original posts and replies. This is critical for ensuring the feed doesn't become over-saturated with conversational threads.

4. Observing Pipeline Storage Health

The Thunder service manages an in-memory PostStore with an automated cleanup task. Monitoring the "trimming" process ensures you aren't leaking memory while retaining enough history for the Phoenix model.

Check Retention Logs: The system logs the status of the auto-trim task every 2 minutes (default).
```
Started PostStore auto-trim task (interval: 2 minutes, retention: 2.0 days)
```
Monitor Batch Processing:
- KAFKA_MESSAGES_FAILED_PARSE: If this counter increases, it indicates a schema mismatch between your producers and the algorithm's deserializers.
- BATCH_PROCESSING_TIME: Tracks how long it takes to ingest and index posts from Kafka into the in-memory store.

5. Phoenix Model Inference Health

For the ranking stage, focus on the transformer's attention and embedding health.

Attention Masking: Verify that candidate-to-candidate attention is blocked to prevent information leakage during scoring. You can validate this by checking the make_recsys_attn_mask output in unit tests.
Embedding Scale: If you are debugging ranking quality, monitor the CandidateTower outputs. The model uses L2-normalization for its representations; ensure that the norms of your candidate embeddings consistently approach 1.0.

Summary Checklist for Health Dashboards