Table of Contents
Fetching ...

Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi

TL;DR

This work investigates how large language models can detect framing in news headlines without task-specific fine-tuning, addressing data-scarce social science scenarios. By evaluating GPT-4, GPT-3.5 Turbo, and FLAN-T5 across zero-shot, few-shot, and explainable prompting, the study reveals that explanations generally improve reliability, while GPT-4 excels in few-shot settings yet can misread emotional language as framing. A key finding is that cross-domain and mixed-domain prompts pose challenges, but consistent model agreement can help flag annotation inconsistencies in datasets like GVFC; the authors also introduce an in-the-wild ITW dataset to test real-world applicability. Fine-tuning FLAN-T5 improves in-domain performance but reduces generalizability to broader topics, underscoring the value of prompt-based methods in diverse real-world contexts. Overall, the results provide actionable guidance for researchers applying LLMs to framing detection, including when to use explainable prompts and how to balance model choice with domain considerations, dataset reliability, and evaluation in the wild.

Abstract

Previous studies on framing have relied on manual analysis or fine-tuning models with limited annotated datasets. However, pre-trained models, with their diverse training backgrounds, offer a promising alternative. This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines. We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions. Our results show that explainable predictions lead to more reliable outcomes. GPT-4 performed exceptionally well in few-shot settings but often misinterpreted emotional language as framing, highlighting a significant challenge. Additionally, the results suggest that consistent predictions across multiple models could help identify potential annotation inaccuracies in datasets. Finally, we propose a new small dataset for real-world evaluation on headlines from a diverse set of topics.

Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

TL;DR

This work investigates how large language models can detect framing in news headlines without task-specific fine-tuning, addressing data-scarce social science scenarios. By evaluating GPT-4, GPT-3.5 Turbo, and FLAN-T5 across zero-shot, few-shot, and explainable prompting, the study reveals that explanations generally improve reliability, while GPT-4 excels in few-shot settings yet can misread emotional language as framing. A key finding is that cross-domain and mixed-domain prompts pose challenges, but consistent model agreement can help flag annotation inconsistencies in datasets like GVFC; the authors also introduce an in-the-wild ITW dataset to test real-world applicability. Fine-tuning FLAN-T5 improves in-domain performance but reduces generalizability to broader topics, underscoring the value of prompt-based methods in diverse real-world contexts. Overall, the results provide actionable guidance for researchers applying LLMs to framing detection, including when to use explainable prompts and how to balance model choice with domain considerations, dataset reliability, and evaluation in the wild.

Abstract

Previous studies on framing have relied on manual analysis or fine-tuning models with limited annotated datasets. However, pre-trained models, with their diverse training backgrounds, offer a promising alternative. This paper presents a comprehensive analysis of GPT-4, GPT-3.5 Turbo, and FLAN-T5 models in detecting framing in news headlines. We evaluated these models in various scenarios: zero-shot, few-shot with in-domain examples, cross-domain examples, and settings where models explain their predictions. Our results show that explainable predictions lead to more reliable outcomes. GPT-4 performed exceptionally well in few-shot settings but often misinterpreted emotional language as framing, highlighting a significant challenge. Additionally, the results suggest that consistent predictions across multiple models could help identify potential annotation inaccuracies in datasets. Finally, we propose a new small dataset for real-world evaluation on headlines from a diverse set of topics.
Paper Structure (35 sections, 1 figure, 13 tables)