Table of Contents
Fetching ...

What is Sentiment Meant to Mean to Language Models?

Michael Burnham

TL;DR

Sentiment analysis with LLMs is challenged by the confounded nature of 'sentiment,' which blends emotion and opinion across domains. The authors compare prompts for stance, valence, and sentiment across three LLMs on two datasets, using Matthews Correlation Coefficient to assess alignment with human labels. They find that explicit stance prompts most accurately recover opinion labels, while sentiment prompts largely map onto emotional valence and underperform for stance; valence prompts perform best for valence labeling. The work argues for precise measurement constructs and, where feasible, prompts that target those dimensions, noting that smaller encoders can often achieve strong results without relying on the largest LLMs.

Abstract

Sentiment analysis is one of the most widely used techniques in text analysis. Recent advancements with Large Language Models have made it more accurate and accessible than ever, allowing researchers to classify text with only a plain English prompt. However, "sentiment" entails a wide variety of concepts depending on the domain and tools used. It has been used to mean emotion, opinions, market movements, or simply a general ``good-bad'' dimension. This raises a question: What exactly are language models doing when prompted to label documents by sentiment? This paper first overviews how sentiment is defined across different contexts, highlighting that it is a confounded measurement construct in that it entails multiple variables, such as emotional valence and opinion, without disentangling them. I then test three language models across two data sets with prompts requesting sentiment, valence, and stance classification. I find that sentiment labels most strongly correlate with valence labels. I further find that classification improves when researchers more precisely specify their dimension of interest rather than using the less well-defined concept of sentiment. I conclude by encouraging researchers to move beyond "sentiment" when feasible and use a more precise measurement construct.

What is Sentiment Meant to Mean to Language Models?

TL;DR

Sentiment analysis with LLMs is challenged by the confounded nature of 'sentiment,' which blends emotion and opinion across domains. The authors compare prompts for stance, valence, and sentiment across three LLMs on two datasets, using Matthews Correlation Coefficient to assess alignment with human labels. They find that explicit stance prompts most accurately recover opinion labels, while sentiment prompts largely map onto emotional valence and underperform for stance; valence prompts perform best for valence labeling. The work argues for precise measurement constructs and, where feasible, prompts that target those dimensions, noting that smaller encoders can often achieve strong results without relying on the largest LLMs.

Abstract

Sentiment analysis is one of the most widely used techniques in text analysis. Recent advancements with Large Language Models have made it more accurate and accessible than ever, allowing researchers to classify text with only a plain English prompt. However, "sentiment" entails a wide variety of concepts depending on the domain and tools used. It has been used to mean emotion, opinions, market movements, or simply a general ``good-bad'' dimension. This raises a question: What exactly are language models doing when prompted to label documents by sentiment? This paper first overviews how sentiment is defined across different contexts, highlighting that it is a confounded measurement construct in that it entails multiple variables, such as emotional valence and opinion, without disentangling them. I then test three language models across two data sets with prompts requesting sentiment, valence, and stance classification. I find that sentiment labels most strongly correlate with valence labels. I further find that classification improves when researchers more precisely specify their dimension of interest rather than using the less well-defined concept of sentiment. I conclude by encouraging researchers to move beyond "sentiment" when feasible and use a more precise measurement construct.
Paper Structure (6 sections, 4 figures)

This paper contains 6 sections, 4 figures.

Figures (4)

  • Figure 1: Classification performance with stance, sentiment, and valence prompts on stance labeled documents. Using a prompt for the specific measurement of interest can provide a dramatic improvement over sentiment classification.
  • Figure 2: Probability of valence label given sentiment label across models. Higher values indicate greater agreement between valence and sentiment prompts. Valence-sentiment labels agree at a much hugher rate than stance-sentiment labels.
  • Figure 3: Probability of stance label given sentiment label across models. Higher values indicate greater agreement between stance and sentiment prompts. Stance-sentiment disagreement exceeds valence-sentiment disagreement.
  • Figure 4: Classification performance on valence labeled documents for sentiment and valence prompts. While LLMs generally understand sentiment sentiment to mean valence, classification performance improves across all models when using a prompt for valence classification.