Table of Contents
Fetching ...

LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma

Jana Juroš, Laura Majer, Jan Šnajder

TL;DR

This paper compares the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages and analyzes the descriptive–prescriptive continuum.

Abstract

News headlines often evoke sentiment by intentionally portraying entities in particular ways, making targeted sentiment analysis (TSA) of headlines a worthwhile but difficult task. Due to its subjectivity, creating TSA datasets can involve various annotation paradigms, from descriptive to prescriptive, either encouraging or limiting subjectivity. LLMs are a good fit for TSA due to their broad linguistic and world knowledge and in-context learning abilities, yet their performance depends on prompt design. In this paper, we compare the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages. Exploring the descriptive--prescriptive continuum, we analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. Finally, we evaluate the ability of LLMs to quantify uncertainty via calibration error and comparison to human label variation. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness, yet the optimal level varies.

LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma

TL;DR

This paper compares the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages and analyzes the descriptive–prescriptive continuum.

Abstract

News headlines often evoke sentiment by intentionally portraying entities in particular ways, making targeted sentiment analysis (TSA) of headlines a worthwhile but difficult task. Due to its subjectivity, creating TSA datasets can involve various annotation paradigms, from descriptive to prescriptive, either encouraging or limiting subjectivity. LLMs are a good fit for TSA due to their broad linguistic and world knowledge and in-context learning abilities, yet their performance depends on prompt design. In this paper, we compare the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages. Exploring the descriptive--prescriptive continuum, we analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. Finally, we evaluate the ability of LLMs to quantify uncertainty via calibration error and comparison to human label variation. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness, yet the optimal level varies.
Paper Structure (28 sections, 1 equation, 3 figures, 14 tables)

This paper contains 28 sections, 1 equation, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Average F1 scores and calibration accuracy for uncertainty quantification methods across levels (shaded ellipses indicate covariances)
  • Figure 2: F1 score per majority vote bins for annotators (X) and model (Y) for SCS vs. DP for NC and GPT-4
  • Figure 3: Comparison of F1 scores and calibration accuracy for various uncertainty quantification methods and across levels of prompt prescriptiveness (indicated by dot size). The gray lines indicate the Pareto front.