Skip to main content

Candidate Generation

Candidate generation is the critical first stage of X’s recommendation pipeline, responsible for narrowing down approximately 1 billion potential tweets to a manageable set of thousands of candidates for downstream ranking. This process leverages diverse candidate sources and user behavior signals.

Overview

The candidate sourcing stage uses X user behavior as the primary input to identify potentially relevant content. Multiple specialized systems work in parallel to retrieve candidates from different perspectives.

Input

~1 Billion Tweets

Process

Multi-Source Retrieval

Output

~2-5K Candidates

Candidate Sources

Home Mixer orchestrates multiple candidate sources to retrieve diverse content:

For You Timeline Sources

Earlybird Search Index: Find and rank tweets from accounts the user follows
  • Coverage: ~50% of For You timeline candidates
  • Method: Search index traversal with light ranker scoring
  • Pipeline: ScoredTweetsInNetworkCandidatePipelineConfig
// In-network candidate retrieval
val inNetworkCandidates = earlybirdClient.search(
  userId = request.userId,
  followedUserIds = socialGraph.getFollowing(request.userId),
  maxResults = 2000,
  rankingMode = LightRanker
)
Earlybird combines candidate retrieval with light ranking for efficient in-network scoring

User Signals for Candidate Sourcing

Candidate sources use diverse user behavior signals to identify relevant content:

Explicit Signals

Social Graph

  • Author Follow: Accounts the user follows
  • Author Unfollow: Recently unfollowed accounts
  • Author Mute: Muted accounts
  • Author Block: Blocked accounts

Tweet Engagement

  • Tweet Favorite: Liked tweets
  • Tweet Unfavorite: Unliked tweets
  • Retweet: Retweeted content
  • Quote Tweet: Retweets with comments
  • Tweet Reply: Replied to tweets
  • Tweet Share: Shared tweets
  • Tweet Bookmark: Bookmarked content

Negative Signals

  • Tweet Don’t Like: “Not interested” feedback
  • Tweet Report: Reported tweets

Implicit Signals

  • Tweet Click: Viewed tweet details
  • Tweet Video Watch: Video watch time
  • Notification Open: Opened push notifications
  • Ntab Click: Clicks from notifications tab

Signal Usage by Component

Different candidate sources use signals as features and/or training labels:
USS = User Signal Service, FRS = Follow Recommendation Service
SignalUSSSimClustersTwHINUTEGFRSLight Ranking
Author FollowFeaturesFeatures/LabelsFeatures/LabelsFeaturesFeatures/LabelsN/A
Tweet FavoriteFeaturesFeaturesFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
RetweetFeaturesN/AFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
Quote TweetFeaturesN/AFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
Tweet ReplyFeaturesN/AFeaturesFeaturesFeatures/LabelsFeatures
Tweet ClickFeaturesN/AN/AN/AFeaturesLabels
Video WatchFeaturesFeaturesN/AN/AN/ALabels
Notification OpenFeaturesFeaturesFeaturesN/AFeaturesN/A

Candidate Source Algorithms

SimClusters

Community detection and sparse embeddings
1

Community Detection

Identify communities of users with similar interests:
// SimClusters community detection
val communities = detectCommunities(
  userFollowGraph,
  numCommunities = 145000
)
2

User Embeddings

Represent users as sparse vectors over communities:
// User representation
val userEmbedding = Map(
  communityId_1234 -> 0.8,  // Strong affinity
  communityId_5678 -> 0.6,  // Medium affinity
  communityId_9012 -> 0.3   // Weak affinity
)
3

Tweet Embeddings

Represent tweets based on engagement from community members:
// Tweet representation  
val tweetEmbedding = Map(
  communityId_1234 -> 0.7,  // Engaged by community 1234
  communityId_5678 -> 0.4   // Engaged by community 5678
)
4

Candidate Retrieval

Find tweets from user’s communities:
// Retrieve candidates via community overlap
val candidates = userEmbedding.keys.flatMap { communityId =>
  getTweetsFromCommunity(communityId)
}.sortBy(tweetScore).take(500)

TwHIN

Dense knowledge graph embeddings for Users and Tweets
Build heterogeneous graph with multiple entity types:
# TwHIN graph structure
graph = {
    'users': user_nodes,
    'tweets': tweet_nodes,
    'topics': topic_nodes,
    'edges': [
        ('user', 'follows', 'user'),
        ('user', 'likes', 'tweet'),
        ('user', 'interested_in', 'topic'),
        ('tweet', 'about', 'topic'),
    ]
}

Real Graph

Predict likelihood of user-to-user interaction
// Real Graph scoring
val realGraphScore = realGraphModel.predict(
  sourceUser = userId,
  destUser = authorId,
  features = Seq(
    mutualFollowCount,
    recentInteractionCount,
    followDuration,
    commonInterests
  )
)

// Use for candidate weighting
val weightedScore = candidateScore * realGraphScore

Candidate Pipeline Flow

1

Parallel Retrieval

Query all candidate sources simultaneously:
val candidateFutures = Future.collect(Seq(
  earlybirdSource.get(request),
  utegSource.get(request),
  tweetMixerSource.get(request),
  frsSource.get(request)
))
2

Candidate Merging

Combine candidates from all sources:
val allCandidates = candidateFutures.map { sources =>
  sources.flatten.distinctBy(_.tweetId)
}
3

Basic Filtering

Apply lightweight filters in candidate pipeline:
val filtered = allCandidates.filter { candidate =>
  !isBlocked(candidate.authorId) &&
  !isMuted(candidate.authorId) &&
  !hasSeenRecently(candidate.tweetId) &&
  meetsQualityThreshold(candidate)
}
4

Deduplication

Remove duplicate candidates:
val deduplicated = filtered
  .groupBy(_.tweetId)
  .map { case (id, duplicates) =>
    // Keep highest scoring version
    duplicates.maxBy(_.score)
  }
5

Pass to Ranking

Send candidates to feature hydration and scoring:
val rankedCandidates = scoringPipeline(
  candidates = deduplicated,
  maxToRank = 2000
)

GraphJet Framework

Many candidate sources (UTEG, Recos-Injector) use the GraphJet framework:

GraphJet

In-memory graph processing for real-time recommendationsKey Features:
  • Real-time graph updates from user actions
  • Sub-millisecond graph traversal queries
  • Bipartite graph representation (users ↔ tweets)
  • Time-decayed edge weights for recency
// GraphJet graph structure
class UserTweetGraph {
  // Bipartite graph: users -> tweets
  val userToTweets: Map[UserId, Seq[(TweetId, Timestamp)]]
  
  // Reverse index: tweets -> users  
  val tweetToUsers: Map[TweetId, Seq[(UserId, Timestamp)]]
  
  def recommend(userId: UserId): Seq[TweetId] = {
    // 1. Get tweets user engaged with
    val seedTweets = userToTweets(userId)
    
    // 2. Find users who engaged with same tweets
    val similarUsers = seedTweets.flatMap { tweet =>
      tweetToUsers(tweet._1)
    }.distinct
    
    // 3. Get tweets those users engaged with
    val candidates = similarUsers.flatMap { user =>
      userToTweets(user._1)
    }.filterNot(seedTweets.contains)
    
    // 4. Score and rank
    candidates.groupBy(_._1).map { case (tweetId, occurrences) =>
      (tweetId, occurrences.size)  // Score by frequency
    }.toSeq.sortBy(-_._2).map(_._1)
  }
}

Performance Characteristics

Reduction Ratio

~1,000,000:1 reduction from all tweets to candidates

Latency

50-200ms total for parallel candidate retrieval

Diversity

Multiple sources ensure diverse content perspectives

Freshness

Real-time graph updates capture latest user behavior

Candidate Quality Signals

Early quality filtering in candidate generation:
// Quality gates for candidates
val qualityCriteria = Seq(
  // Author reputation
  authorTweepCredScore > minReputationThreshold,
  
  // Early engagement
  earlyLikeCount > minEarlyEngagement,
  
  // Content safety
  !isFlaggedByTrustAndSafety,
  
  // Spam detection  
  !isLikelySpam,
  
  // Language match
  tweetLanguage.isCompatibleWith(userLanguages)
)

Learn More

Ranking Systems

Learn how candidates are scored and ranked

Product Mixer

Explore the pipeline framework orchestrating candidate generation

Navi ML Serving

Understand how embedding models are served