Integrating Custom Embeddings
The Phoenix ranker uses a sophisticated multi-hash embedding system to represent users, posts, and authors. Instead of using raw IDs or hand-engineered features, the model maps entities into a shared vector space using multiple hash functions.
This guide walks you through configuring the hashing logic and integrating custom embedding lookups into the Phoenix pipeline.
Overview of Hash-Based Embeddings
Phoenix does not use a single embedding table for an ID. Instead, it uses multiple hash functions to index into smaller tables, which are then reduced into a final representation. This approach reduces collisions and improves the model's ability to generalize across a massive ID space.
The process follows three main steps:
- Hashing: Mapping an ID (User, Post, or Author) to $N$ hash values.
- Lookup: Retrieving vectors from an embedding table for each hash value.
- Reduction: Combining these vectors using a learnable projection (e.g.,
block_user_reduce).
Step 1: Configure Hashing Parameters
First, define how many hash functions the model should expect for each entity type. This is managed via the HashConfig class.
from phoenix.recsys_model import HashConfig
# Define how many hash functions to use per entity
hash_config = HashConfig(
num_user_hashes=2,
num_item_hashes=2,
num_author_hashes=2
)
Note: The hash value
0is reserved for padding or invalid entities. Ensure your hashing logic accounts for this offset.
Step 2: Prepare the Embedding Container
Before passing data to the transformer, you must look up the raw vectors from your storage layer (e.g., a key-value store or a JAX array). These are then wrapped in the RecsysEmbeddings dataclass.
The shape for these arrays should generally be [BatchSize, NumHashes, EmbeddingDimension].
import jax.numpy as jnp
from phoenix.recsys_model import RecsysEmbeddings
# Example: Prepare looked-up embeddings for a batch of 1
# D = Embedding Dimension (e.g., 64)
embeddings = RecsysEmbeddings(
user_embeddings=jnp.ones((1, 2, 64)), # [B, num_user_hashes, D]
history_post_embeddings=jnp.ones((1, 128, 2, 64)), # [B, SeqLen, num_item_hashes, D]
candidate_post_embeddings=jnp.ones((1, 32, 2, 64)),# [B, CandLen, num_item_hashes, D]
history_author_embeddings=jnp.ones((1, 128, 2, 64)),
candidate_author_embeddings=jnp.ones((1, 32, 2, 64))
)
Step 3: Implement the Reduction Logic
Phoenix provides built-in utilities to "flatten" these multiple hash embeddings into a single vector that the transformer can process. You will typically use block_user_reduce and block_history_reduce inside your model's forward pass.
Reducing User Embeddings
The block_user_reduce function concatenates the hash-embeddings and projects them back to the model's standard embedding size.
from phoenix.recsys_model import block_user_reduce
# user_hashes: [B, num_user_hashes]
# user_embeddings: [B, num_user_hashes, D]
user_repr, padding_mask = block_user_reduce(
user_hashes=batch.user_hashes,
user_embeddings=embeddings.user_embeddings,
num_user_hashes=2,
emb_size=64
)
Reducing History Sequences
For user history (the sequence of posts a user has interacted with), block_history_reduce combines post, author, action, and surface embeddings into a single sequence representation.
from phoenix.recsys_model import block_history_reduce
history_sequence, history_mask = block_history_reduce(
history_post_hashes=batch.history_post_hashes,
history_post_embeddings=embeddings.history_post_embeddings,
history_author_embeddings=embeddings.history_author_embeddings,
history_product_surface_embeddings=surface_embs,
history_actions_embeddings=action_embs,
num_item_hashes=2,
num_author_hashes=2
)
Step 4: Integrating with Candidate Towers
For retrieval tasks (Phoenix Retrieval), candidate posts and authors must be projected into a shared normalized space using the CandidateTower. This allows for efficient dot-product similarity searches.
from phoenix.recsys_retrieval_model import CandidateTower
# Initialize the tower
candidate_tower = CandidateTower(emb_size=64)
# Project concatenated post and author embeddings
# Shape: [Batch, NumCandidates, (NumPostHashes + NumAuthorHashes) * D]
candidate_repr = candidate_tower(concatenated_embeddings)
Best Practices
- Initialization Scale: When using
block_user_reduce, you can adjust theembed_init_scale. Higher scales can help speed up initial convergence but may lead to instability if too high. - Normalization: The
CandidateTowerautomatically applies L2 normalization to the output. If you are implementing a custom retrieval tower, ensure you normalize your embeddings to maintain compatibility with the ANN (Approximate Nearest Neighbor) indices used in the "For You" pipeline. - Action Embeddings: Always include the user's "action" (like, reply, etc.) when building history embeddings. The Grok transformer relies heavily on the relationship between the post content and the user's specific engagement type.