how it works

Short description

Twitter has posted the source code of its recommendation algorithm on GitHub, detailing how it extracts the best tweets to fill users’ personalised “For You” feeds. Twitter receives more than 500 million tweets each day, with the recommendations algorithm crucial to notifying users about the most important issues within their sphere of interest. Three stages are involved: selecting candidates, ranking via a machine learning model and application heuristics and filters. Twitter’s Home Mixer service is then responsible for creating and providing content feeds. The aim of making the code public is to increase transparency, with planned updates including analytics and security features.

how it works

Almost a year ago, Elon Musk proposed making Twitter’s recommendation algorithm public. The company recently posted the source code of its algorithm on GitHub.

In the article – their translation blog post with a description of the recommendations algorithm. It will fit:

  • anyone who wants to know how algorithms choose what to show you in the feed,

  • Data Scientists and ML engineers as a unique source of insights into the operation of a large recommender system.


Twitter aims to show you the most relevant of what’s happening in the world right now. This requires a recommendation algorithm that can extract from 500 million tweets every day the best ones that will be shown in the “For You” section. In this article, we will explain how the algorithm selects a tweet for your feed.

How do we choose tweets?

The basis of Twitter’s recommendations is a set of algorithms and functions that derive hidden information from tweets, users and interaction data. These models seek to answer important questions such as “How likely is it that you will interact with this user in the future?” or “Which communities stand out on Twitter and what tweets are trending in them?” Accurate answers to these questions allow you to make more relevant recommendations.

The recommendation system consists of three main stages:

  1. Selection of candidates – Extracting the best tweets from different recommendation sources.

  2. Ranking of these tweets using a machine learning model.

  3. Application heuristics and filtersfor example, filtering tweets from users you’ve blocked, NSFW content, and tweets you’ve already seen.

The service responsible for creating and providing the For You feed is called Home Mixer. Home Mixer is built on top of Product Mixer, our custom Scala platform that makes it easy to create a content feed. This service connects various candidate sources, scoring functions, heuristics and filters.

The diagram below illustrates the main components used to create a ribbon:

Let’s consider the key elements of this system in approximately the order in which they are invoked during a single request to display a feed. Let’s start with receiving candidates from Source of candidates.

Sources of candidates

Twitter has several candidate sources for fresh and relevant tweets. Through these sources, we try to get the best 1,500 tweets out of hundreds of millions for each query. We find candidates from users you follow (In-Network), and from users you are not subscribed to (Out-of-Network). The For You feed is on average 50% In-Network Tweets and 50% Out-of-Network Tweets, although this percentage can vary from user to user.

Source In-Network

In-Network is the largest source of candidates. It provides tweets from users you follow. Using a logistic regression model, these tweets are sorted by their relevance. The best tweets are then sent to the next stage.

The most important component in In-Network tweet ranking is the Real Graph. Real Graph is a model that assumes the possibility of interaction between two users. The higher the Real Graph between you and the tweeter, the more we’ll include their tweets.

The In-Network source was recently redesigned. We stopped using the Fanout Service, a 12-year-old service that provided tweets from a cache for each user. Also we are reworking the logistic regression ranking model that was last updated and trained a few years ago!

Out-of-Network sources

Finding relevant tweets outside the user’s network is a more difficult problem: How can we tell if certain tweets will be relevant to you if you’re not following the author? Twitter uses two approaches to solve this problem.

1 Social Graph

The first approach analyzes the likes of people you follow or those who have similar interests to you.

We go through the graph of interactions and subscriptions to answer the following questions:

  • What tweets have people I’ve followed recently liked?

  • Who is barking at the same tweets as me and what else have they liked recently?

We generate candidates based on the answers to these questions and rank the resulting tweets using a logistic regression model. Such graph traversals are critical to our recommendations. To do this, we developed GraphJet, a graph processing engine that maintains a real-time graph of interactions between users and tweets. While this approach has proven to be useful (accounting for about 15% of homepage feed tweets), approaches based on the embedding space have contributed more.

2 Embedding Spaces

Embedding-based approaches want to answer a general question about content similarity: Which tweets and users are similar to my interests?

Embedding is a numerical representation of user interests and tweet content. From them, we can calculate the similarity between any two users, tweets, or user-tweet pairs in this space of embeddings. This similarity can be used as a proxy for relevance, provided the embeddings are sufficiently precise.

One of the most useful embedding spaces on Twitter is SimClusters. SimClusters find communities around influential users (influencers) using their own matrix decomposition algorithm. There are 145,000 communities that are updated every three weeks. Users and tweets can belong to multiple communities. Communities range in size from a few thousand users for individual groups of friends to hundreds of millions of users for news or pop culture. Here are some of the largest groups:

We include a tweet in a community based on its current popularity in that community. The more users in a community like it, the more that tweet will be associated with that community.

Ranking

At this point we have ~1500 potentially relevant candidates. The next step is to fast track each candidate for a match to your feed. Here, all candidates are treated equally, regardless of source.

The ranking is achieved using a neural network with ~48 million parameters that continuously learns to interact with tweets. It optimizes positive feedback (eg likes, retweets and replies). This ranking engine considers thousands of features and outputs ten labels. Thus, each tweet receives a composite score, where each label indicates an opportunity for interaction. We rank tweets based on these scores.

Heuristics, filters and additional functions

The next step is to apply heuristics and filters to improve product quality. Additional features interact with each other to create a balanced and diverse feed. Here are some examples:

  • Visibility filtering: Filter tweets based on their content and your preferences. For example, remove tweets from accounts you’ve blocked.

  • Diversity of authors: avoiding long sequences of tweets from the same author

  • Content balance: ensuring a balance of In-Network and Out-of-Network tweets.

  • Accounting for negative feedback: reducing the rate of tweets close to the ones where you gave negative feedback.

  • Confirmation from the environment: exclusion of tweets from users with more than 2nd level of connection. That is, it is guaranteed that among your followers there is a user who interacted with this tweet or followed its author.

  • Correspondence: adding the original tweet to the reply

  • Edited tweets: Identify tweets that are currently out of date on the device and replace them with edited versions.

Data enrichment and transfer

At this point, Home Mixer receives a set of tweets ready to be sent to the device. Tweets are mixed with other content, such as advertising, follow recommendations, and tips, which are then fed back to the device for display.

The above pipeline runs approximately 5 billion daily and executes in a total of 1.5 seconds. At the same time, one pipeline run requires 220 seconds of CPU – almost 150 times more than the latency you see in the application.

The main goal of the open source project is to provide complete transparency to you, our users, about how our systems work. We’ve made the recommendation code available for a more in-depth look at our algorithm, which can be viewed here (and here). We’re also working on providing more transparency around other features within our app. Some of the planned new developments include:

  • The best analytics platform for content creators with more insight into reach and engagement.

  • Greater transparency about any security labels applied to your tweets or accounts.

  • Great visibility into why tweets appear in your feed.


If you want to read more about Data Science, Machine Learning and more, subscribe to my Telegram channel.

Related posts