Limits to Predicting Online Speech Using Large Language Models
Mina Remeli, Moritz Hardt, Robert C. Williamson
TL;DR
The paper investigates how well state-of-the-art transformer-based language models can predict individual online speech on Twitter using negative log-likelihood and bits-per-character as uncertainty metrics. By collecting millions of tweets and evaluating four model families under no, random, peer, and user context, it finds that user context provides the strongest predictive signal, while predictions remain far from language-entropy limits; even with context, only the largest models approach the English entropy benchmark of about $1.12$ bits per character, and the predictive gain from peer context is consistently smaller. The study shows that context improves predictability largely by enabling models to learn common Twitter syntax such as hashtags and @-mentions, but tweet-tuning reduces reliance on in-context information and yields robust rankings across models and demographics. Across demographic proxies, results are robust, though predictive difficulty varies by group (e.g., Nigeria). The work informs debates on privacy risks in platform data and suggests limits to how much individual online speech can be predicted by current LLMs, guiding future research into the intersection of model uncertainty and real-world impact.
Abstract
Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20\% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.
