Table of Contents
Fetching ...

Linear Representations of Sentiment in Large Language Models

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

TL;DR

<3-5 sentence high-level summary>Large language models encode sentiment along a linear direction in activation space, and this sentiment axis is causally relevant across both toy benchmarks and real-world data. The study introduces robust causal interventions (activation patching, directional ablations, and distributed alignment search) and demonstrates that a single sentiment direction generalizes best in intermediate layers, with negation and punctuation shaping its expression. A central finding is the summarization motif, where sentiment information is stored at intermediate, non-valenced tokens (e.g., commas, periods, certain nouns) and acts as an information bottleneck that meaningfully influences final predictions. Together, these results illuminate interpretable sentiment circuits in LLMs and provide a framework for probing internal representations and their implications for world-modeling and safety.

Abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.

Linear Representations of Sentiment in Large Language Models

TL;DR

<3-5 sentence high-level summary>Large language models encode sentiment along a linear direction in activation space, and this sentiment axis is causally relevant across both toy benchmarks and real-world data. The study introduces robust causal interventions (activation patching, directional ablations, and distributed alignment search) and demonstrates that a single sentiment direction generalizes best in intermediate layers, with negation and punctuation shaping its expression. A central finding is the summarization motif, where sentiment information is stored at intermediate, non-valenced tokens (e.g., commas, periods, certain nouns) and acts as an information bottleneck that meaningfully influences final predictions. Together, these results illuminate interpretable sentiment circuits in LLMs and provide a framework for probing internal representations and their implications for world-modeling and safety.

Abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.
Paper Structure (67 sections, 3 equations, 18 figures, 1 table)

This paper contains 67 sections, 3 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Visual verification that a single direction captures sentiment across diverse contexts. Color represents the projection onto this direction, blue is positive and red is negative. Examples (\ref{['fig:neuroscope-nouns']}-\ref{['fig:neuroscope-medical']}) show the $K$-means sentiment direction for the first layer of GPT2-small on samples from OpenWebText. Example \ref{['fig:neuroscope-french']} shows the $K$-means sentiment direction for the 7th layer of pythia-1.4b on the opening of Harry Potter in French.
  • Figure 2: Cosine similarity of directions learned by different methods in GPT2-small's first layer. Each sentiment direction was derived from adjective representations in the ToyMovieReview dataset (Section \ref{['section:datasets']}).
  • Figure 3: Area plot of sentiment labels for OpenWebText samples by $K$-means sentiment activation (left). Accuracy using sentiment activations to classify tokens as positive or negative (right). The threshold taken is the top/bottom 0.1% of activations over OpenWebText. Sentiment activations are taken from GPT2-small's first residual stream layer. Classification was performed by GPT-4.
  • Figure 4: results for different methods in pythia-1.4b. We report the best result found across layers. The columns show two evaluation datasets, ToyMovieReview and Treebank, and two evaluation metrics, mean logit difference and % of logit differences flipped.
  • Figure 5: We made a dataset of 27 negation examples and compute the change in sentiment activation at the negated token (e.g. doubt) between the 1st and 10th layers of GPT2-small. We show sample text across layers for $K$-means (left), the fraction of activations flipped and the median size of the flip centered around the mean activation (right).
  • ...and 13 more figures