Pitfalls of Conversational LLMs on News Debiasing

Ipek Baris Schlicht; Defne Altiok; Maryanne Taouk; Lucie Flek

Pitfalls of Conversational LLMs on News Debiasing

Ipek Baris Schlicht, Defne Altiok, Maryanne Taouk, Lucie Flek

TL;DR

The paper addresses debiasing in news editing and evaluates whether conversational LLMs can reliably produce unbiased yet faithful news text. It introduces a domain-specific editorial checklist, uses a bias-labeled subset from the BABE dataset, and compares three popular LLMs (ChatGPT, GPT-4, Llama2-70b-chat) against a T5 baseline with LoRA, with expert editors providing ground-truth assessments. Findings show that while LLMs improve bias reduction and grammar, they frequently alter information, context, or the author's writing style and sometimes hallucinate, with evaluations diverging from expert judgments. The work highlights safety and reliability concerns for fully automatic news debiasing and calls for benchmark datasets and human-in-the-loop approaches to enable more robust deployment.

Abstract

This paper addresses debiasing in news editing and evaluates the effectiveness of conversational Large Language Models in this task. We designed an evaluation checklist tailored to news editors' perspectives, obtained generated texts from three popular conversational models using a subset of a publicly available dataset in media bias, and evaluated the texts according to the designed checklist. Furthermore, we examined the models as evaluator for checking the quality of debiased model outputs. Our findings indicate that none of the LLMs are perfect in debiasing. Notably, some models, including ChatGPT, introduced unnecessary changes that may impact the author's style and create misinformation. Lastly, we show that the models do not perform as proficiently as domain experts in evaluating the quality of debiased outputs.

Pitfalls of Conversational LLMs on News Debiasing

TL;DR

Abstract

Paper Structure (7 sections, 2 figures, 3 tables)

This paper contains 7 sections, 2 figures, 3 tables.

Introduction
Related Works
Methodology
News Editorial Criteria
Debiasing Models
Results
Conclusion

Figures (2)

Figure 1: Biased text where the usage "blind-sided" introduces bias by conveying a strong negative opinion about the actions of state and local officials and its GPT4 debiased version which doesn't contain toxicity according to Perspective API. Debiasing changed the facts and the context (factually incorrect statement highlighted in red, original version in blue).
Figure 2: Prompts for debiasing and evaluation. The full version of the evaluator prompt can be found at our source code.

Pitfalls of Conversational LLMs on News Debiasing

TL;DR

Abstract

Pitfalls of Conversational LLMs on News Debiasing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)