Table of Contents
Fetching ...

Limits to Predicting Online Speech Using Large Language Models

Mina Remeli, Moritz Hardt, Robert C. Williamson

TL;DR

The paper investigates how well state-of-the-art transformer-based language models can predict individual online speech on Twitter using negative log-likelihood and bits-per-character as uncertainty metrics. By collecting millions of tweets and evaluating four model families under no, random, peer, and user context, it finds that user context provides the strongest predictive signal, while predictions remain far from language-entropy limits; even with context, only the largest models approach the English entropy benchmark of about $1.12$ bits per character, and the predictive gain from peer context is consistently smaller. The study shows that context improves predictability largely by enabling models to learn common Twitter syntax such as hashtags and @-mentions, but tweet-tuning reduces reliance on in-context information and yields robust rankings across models and demographics. Across demographic proxies, results are robust, though predictive difficulty varies by group (e.g., Nigeria). The work informs debates on privacy risks in platform data and suggests limits to how much individual online speech can be predicted by current LLMs, guiding future research into the intersection of model uncertainty and real-world impact.

Abstract

Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20\% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.

Limits to Predicting Online Speech Using Large Language Models

TL;DR

The paper investigates how well state-of-the-art transformer-based language models can predict individual online speech on Twitter using negative log-likelihood and bits-per-character as uncertainty metrics. By collecting millions of tweets and evaluating four model families under no, random, peer, and user context, it finds that user context provides the strongest predictive signal, while predictions remain far from language-entropy limits; even with context, only the largest models approach the English entropy benchmark of about bits per character, and the predictive gain from peer context is consistently smaller. The study shows that context improves predictability largely by enabling models to learn common Twitter syntax such as hashtags and @-mentions, but tweet-tuning reduces reliance on in-context information and yields robust rankings across models and demographics. Across demographic proxies, results are robust, though predictive difficulty varies by group (e.g., Nigeria). The work informs debates on privacy risks in platform data and suggests limits to how much individual online speech can be predicted by current LLMs, guiding future research into the intersection of model uncertainty and real-world impact.

Abstract

Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20\% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.
Paper Structure (53 sections, 5 equations, 24 figures, 2 tables)

This paper contains 53 sections, 5 equations, 24 figures, 2 tables.

Figures (24)

  • Figure 1: Predictability of a user's tweets using LLMs. Bits per character (BPC) measures, on average, how many bits are required to predict the next character. Predictability improves with additional context to the model: (i) past user tweets (user context, Fig. \ref{['subfig:usr_ctxt']}) (ii) past tweets from the user's peers (peer context) and (iii) past tweets from random users (control). We plot the average BPC over users in Fig. \ref{['subfig:avg_BPC']} and the estimated entropy rate of the English language from takahashi_cross_2018 as comparison. Most of the predictive information is found in the user context, followed by peer and random context. Our results are robust across models with different parameter sizes and tokenizers.
  • Figure 2: Our data collection process can be divided into two stages. In the first stage (Fig. \ref{['subfig:firehose']}), we collected 10M tweets in early 2023 which served as our base for sampling subjects. In the second stage (Fig. \ref{['subfig:timeline']}), we collected users' timelines.
  • Figure 3: Average effect size of context $c_2$ relative to $c_1$ on model uncertainty. Darker green means greater improvement in model uncertainty (the model becomes less uncertain). For example, user context significantly improves model uncertainty by $3.1\sigma$ over having no context (top right corner). Model: Llama-2-70B.
  • Figure 4: Average improvement in NLL from additional user context (compared to none). The first few tokens of a tweet benefit most from the additional context. Model: Llama-2-70B.
  • Figure 5: Average model uncertainty on tweets without @-mentions and hashtags. Subjects become more predictable on average, and the positive effect of context on predictability decreases. Lighter bars are the results from our original experiment for comparison.
  • ...and 19 more figures