The Candidate Pipeline Lifecycle

The Candidate Pipeline is the heart of the "For You" feed, responsible for transforming a massive global corpus of posts into a personalized, ranked list for a specific user. This guide walks you through the lifecycle of a post as it moves from raw data to a ranked candidate in your feed.

The process is orchestrated by the Home Mixer and follows four distinct phases: Retrieval, Hydration, Scoring, and Selection.

Step 1: Retrieval (Finding Potential Candidates)

The lifecycle begins with gathering a "pool" of potential posts from two distinct sources. This reduces the search space from millions of posts down to a few thousand.

In-Network (Thunder): The system queries the Thunder service to retrieve recent posts from accounts the user follows. It uses the GetInNetworkPosts gRPC call to fetch these "LightPosts."
Out-of-Network (Phoenix Retrieval): This identifies relevant posts from people the user doesn't follow. It uses a Two-Tower Model:
- User Tower: Encodes the user's history and actions into a normalized vector.
- Candidate Tower: Projects global posts and their authors into the same embedding space.
- ANN Search: The system performs an Approximate Nearest Neighbor search (dot product similarity) to find posts that align with the user's representation.

# Conceptual Retrieval Logic
user_representation = user_tower(user_history_hashes)
top_k_posts = ann_index.query(user_representation, k=1000)

Step 2: Hydration (Enriching the Candidates)

Raw post IDs aren't enough for high-precision ranking. The Hydrator phase enriches candidates with the metadata required by the Grok-based transformer.

Feature Gathering: The system fetches author information, post content metadata, and the user's recent action sequence (likes, replies, shares).
Hashing: To eliminate hand-engineered features, everything is converted into hashes. The RecsysBatch object is constructed containing:
- user_hashes: Unique identifiers for the user.
- history_post_hashes / history_author_hashes: The last $N$ posts the user interacted with.
- candidate_post_hashes / candidate_author_hashes: The hashes for the post currently being evaluated.
- history_actions: The specific types of engagement (like, reply, etc.) associated with the history.

Step 3: Scoring (Predicting Engagement)

Once hydrated, candidates are passed to Phoenix, the Grok-based ranking model. This stage predicts the probability of the user engaging with each post.

Embedding Lookup: Hashes are converted into dense embeddings using RecsysEmbeddings.
Transformer Processing: The model uses a Grok-1 based transformer architecture. Unlike standard NLP, it uses a specialized attention mask:
- User/History positions use causal attention (attending only to the past).
- Candidate positions attend to the entire user history to understand context, but do not attend to other candidates in the same batch.
Logit Generation: The model outputs a set of logits representing various engagement types (e.g., probability of a Like, probability of a Retweet).

# The scoring model combines history and candidate data
output = phoenix_ranker(
    embeddings=recsys_embeddings,
    batch=recsys_batch,
    mask=recsys_attn_mask
)
# Final Score = weighted_sum(output.logits)

Step 4: Filtering and Selection

The final phase transforms scores into the actual feed the user sees.

Scoring & Weighting: A Scorer applies business logic weights to the model's raw engagement probabilities. For example, a "Reply" might be weighted more heavily than a "Like."
Filtering: The Filter module removes content that shouldn't be shown, such as:
- Duplicate posts or near-duplicate media.
- Blocked or muted content.
- Content already seen by the user in recent sessions.
Selection: The Selector takes the top $N$ filtered results and prepares the final ScoredPostsResponse to be sent to the client.

Summary of the Flow