The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

Samar Haider; Amir Tohidi; Jenny S. Wang; Timothy Dörr; David M. Rothschild; Chris Callison-Burch; Duncan J. Watts

The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

Samar Haider, Amir Tohidi, Jenny S. Wang, Timothy Dörr, David M. Rothschild, Chris Callison-Burch, Duncan J. Watts

TL;DR

The paper introduces the Media Bias Detector, a scalable near-real-time framework that combines broad homepage scraping with an LLM-driven labeling pipeline to annotate topics, lean, tone, sentence type, quotes, and events across publishers.It provides a curated, continuously growing dataset (over 150,000 articles in 2024 from 10 initial publishers, now expanding to 21) and an interactive dashboard enabling multi-level analyses of selection and framing bias.Validation combines automated labeling with systematic human oversight, reporting high accuracy in topic and tone classification and robust event clustering, and revealing patterns such as horse-race emphasis, topic-dependent lean, and headline–article incongruities.The work discusses limitations like data coverage, paywalls, and model dependence while outlining future extensions to additional media formats and broader time horizons to sustain empirical insights on media bias.

Abstract

Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.

The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

TL;DR

Abstract

The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (34)