Integrating Custom Embeddings

The Phoenix ranker uses a sophisticated multi-hash embedding system to represent users, posts, and authors. Instead of using raw IDs or hand-engineered features, the model maps entities into a shared vector space using multiple hash functions.

This guide walks you through configuring the hashing logic and integrating custom embedding lookups into the Phoenix pipeline.

Overview of Hash-Based Embeddings

Phoenix does not use a single embedding table for an ID. Instead, it uses multiple hash functions to index into smaller tables, which are then reduced into a final representation. This approach reduces collisions and improves the model's ability to generalize across a massive ID space.

The process follows three main steps:

Hashing: Mapping an ID (User, Post, or Author) to $N$ hash values.
Lookup: Retrieving vectors from an embedding table for each hash value.
Reduction: Combining these vectors using a learnable projection (e.g., block_user_reduce).

Step 1: Configure Hashing Parameters

First, define how many hash functions the model should expect for each entity type. This is managed via the HashConfig class.

from phoenix.recsys_model import HashConfig

# Define how many hash functions to use per entity
hash_config = HashConfig(
    num_user_hashes=2,
    num_item_hashes=2,
    num_author_hashes=2
)

Note: The hash value 0 is reserved for padding or invalid entities. Ensure your hashing logic accounts for this offset.

Step 2: Prepare the Embedding Container

Before passing data to the transformer, you must look up the raw vectors from your storage layer (e.g., a key-value store or a JAX array). These are then wrapped in the RecsysEmbeddings dataclass.

The shape for these arrays should generally be [BatchSize, NumHashes, EmbeddingDimension].

import jax.numpy as jnp
from phoenix.recsys_model import RecsysEmbeddings

# Example: Prepare looked-up embeddings for a batch of 1
# D = Embedding Dimension (e.g., 64)
embeddings = RecsysEmbeddings(
    user_embeddings=jnp.ones((1, 2, 64)),              # [B, num_user_hashes, D]
    history_post_embeddings=jnp.ones((1, 128, 2, 64)), # [B, SeqLen, num_item_hashes, D]
    candidate_post_embeddings=jnp.ones((1, 32, 2, 64)),# [B, CandLen, num_item_hashes, D]
    history_author_embeddings=jnp.ones((1, 128, 2, 64)),
    candidate_author_embeddings=jnp.ones((1, 32, 2, 64))
)

Step 3: Implement the Reduction Logic

Phoenix provides built-in utilities to "flatten" these multiple hash embeddings into a single vector that the transformer can process. You will typically use block_user_reduce and block_history_reduce inside your model's forward pass.

Reducing User Embeddings

The block_user_reduce function concatenates the hash-embeddings and projects them back to the model's standard embedding size.

from phoenix.recsys_model import block_user_reduce

# user_hashes: [B, num_user_hashes] 
# user_embeddings: [B, num_user_hashes, D]
user_repr, padding_mask = block_user_reduce(
    user_hashes=batch.user_hashes,
    user_embeddings=embeddings.user_embeddings,
    num_user_hashes=2,
    emb_size=64
)

Reducing History Sequences

For user history (the sequence of posts a user has interacted with), block_history_reduce combines post, author, action, and surface embeddings into a single sequence representation.

from phoenix.recsys_model import block_history_reduce

history_sequence, history_mask = block_history_reduce(
    history_post_hashes=batch.history_post_hashes,
    history_post_embeddings=embeddings.history_post_embeddings,
    history_author_embeddings=embeddings.history_author_embeddings,
    history_product_surface_embeddings=surface_embs,
    history_actions_embeddings=action_embs,
    num_item_hashes=2,
    num_author_hashes=2
)

Step 4: Integrating with Candidate Towers

For retrieval tasks (Phoenix Retrieval), candidate posts and authors must be projected into a shared normalized space using the CandidateTower. This allows for efficient dot-product similarity searches.

from phoenix.recsys_retrieval_model import CandidateTower

# Initialize the tower
candidate_tower = CandidateTower(emb_size=64)

# Project concatenated post and author embeddings
# Shape: [Batch, NumCandidates, (NumPostHashes + NumAuthorHashes) * D]
candidate_repr = candidate_tower(concatenated_embeddings)

Best Practices

Initialization Scale: When using block_user_reduce, you can adjust the embed_init_scale. Higher scales can help speed up initial convergence but may lead to instability if too high.
Normalization: The CandidateTower automatically applies L2 normalization to the output. If you are implementing a custom retrieval tower, ensure you normalize your embeddings to maintain compatibility with the ANN (Approximate Nearest Neighbor) indices used in the "For You" pipeline.
Action Embeddings: Always include the user's "action" (like, reply, etc.) when building history embeddings. The Grok transformer relies heavily on the relationship between the post content and the user's specific engagement type.