Profiling System Performance

To maintain a low-latency experience for the "For You" feed, identifying bottlenecks in the retrieval and ranking pipeline is essential. The x-algorithm codebase provides built-in support for profiling both the Rust-based infrastructure (Thunder, Home Mixer) and the Python-based transformer models (Phoenix).

This guide walks you through enabling the profiling servers and capturing performance traces.

Enabling the Profiling Server

The Rust services include an optional profiling server that can be toggled via CLI arguments. When enabled, it provides an endpoint for CPU and memory profiling.

1. Start Thunder with Profiling

To profile the in-network retrieval service (Thunder), use the --enable-profiling flag. By default, this spawns a profiling server on port 3000.

./target/release/thunder \
    --grpc-port 5001 \
    --http-port 8080 \
    --enable-profiling \
    --is-serving

2. Configure Home Mixer Metrics

Home Mixer uses a dedicated metrics port to export telemetry data that can be consumed by Prometheus or visualized via Grafana.

./target/release/home-mixer \
    --grpc-port 5002 \
    --metrics-port 9090 \
    --reload-interval-minutes 60

Profiling the Rust Pipeline

Once the services are running with profiling enabled, you can use standard tools like pprof or the internal xai_profiling interface to capture data.

Capturing a CPU Profile

While the system is under load (e.g., during a load test or a high-traffic period), capture a 30-second CPU profile:

Identify the endpoint: The profiling server usually listens at http://localhost:3000/debug/pprof/profile.

Generate the trace:

# Capture a 30-second profile and save it to a file
curl -o cpu_profile.pb.gz http://localhost:3000/debug/pprof/profile?seconds=30

Visualize: Use pprof to generate a flame graph:
```
pprof -http=:8081 cpu_profile.pb.gz
```

Identifying Memory Bottlenecks

In Thunder, the PostStore maintains a large in-memory cache of recent posts. If you notice high memory usage, check the PostStore stats logger which is automatically started in serving mode:

Look for "PostStore stats" in the application logs.
Monitor the GET_IN_NETWORK_POSTS_EXCLUDED_SIZE metric to see if too many candidates are being filtered out early, which indicates inefficient retrieval.

Profiling Phoenix (JAX/Python)

The Phoenix ranking and retrieval models are built using JAX and Haiku. Performance issues here usually stem from XLA compilation or GPU/TPU utilization.

1. Using the JAX Profiler

To profile the RecsysModel during inference, use the jax.profiler API. This allows you to see the execution time of specific transformer layers.

import jax

# Start the profiler server
jax.profiler.start_server(port=9999)

# Alternatively, wrap a specific execution block
with jax.profiler.TraceAnnotation("inference_step"):
    output = model.apply(params, batch_embeddings)

2. Analyzing XLA Compilation

If the first request to the "For You" feed is extremely slow, it is likely due to XLA compilation. To debug this:

Set the environment variable XLA_FLAGS="--xla_dump_to=/tmp/xla_dump".
Run the model.
Analyze the generated HLO files to identify expensive operations or fusion failures in the transformer block.

Monitoring Real-time Metrics

Both Thunder and Home Mixer export high-level metrics that should be monitored during profiling to correlate performance spikes with specific system behaviors.

Adjusting Concurrency Limits

If you see a high number of REJECTED_REQUESTS despite low CPU usage, you may need to increase the --max-concurrent-requests argument in Thunder. This setting controls the semaphore that prevents service exhaustion during retrieval spikes.