Table of Contents
Fetching ...

The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

Samar Haider, Amir Tohidi, Jenny S. Wang, Timothy Dörr, David M. Rothschild, Chris Callison-Burch, Duncan J. Watts

TL;DR

The paper introduces the Media Bias Detector, a scalable near-real-time framework that combines broad homepage scraping with an LLM-driven labeling pipeline to annotate topics, lean, tone, sentence type, quotes, and events across publishers.It provides a curated, continuously growing dataset (over 150,000 articles in 2024 from 10 initial publishers, now expanding to 21) and an interactive dashboard enabling multi-level analyses of selection and framing bias.Validation combines automated labeling with systematic human oversight, reporting high accuracy in topic and tone classification and robust event clustering, and revealing patterns such as horse-race emphasis, topic-dependent lean, and headline–article incongruities.The work discusses limitations like data coverage, paywalls, and model dependence while outlining future extensions to additional media formats and broader time horizons to sustain empirical insights on media bias.

Abstract

Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.

The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

TL;DR

The paper introduces the Media Bias Detector, a scalable near-real-time framework that combines broad homepage scraping with an LLM-driven labeling pipeline to annotate topics, lean, tone, sentence type, quotes, and events across publishers.It provides a curated, continuously growing dataset (over 150,000 articles in 2024 from 10 initial publishers, now expanding to 21) and an interactive dashboard enabling multi-level analyses of selection and framing bias.Validation combines automated labeling with systematic human oversight, reporting high accuracy in topic and tone classification and robust event clustering, and revealing patterns such as horse-race emphasis, topic-dependent lean, and headline–article incongruities.The work discusses limitations like data coverage, paywalls, and model dependence while outlining future extensions to additional media formats and broader time horizons to sustain empirical insights on media bias.

Abstract

Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.

Paper Structure

This paper contains 30 sections, 34 figures, 2 tables.

Figures (34)

  • Figure 1: Top: The core guiding principles behind the design of our framework. We analyze each article individually to allow direct, data-driven comparisons between publishers regarding their selection and framing of news topics. By extracting simple labels using state-of-the-art LLMs, we ensure a high level of trust in our dataset. As we have a variety of data points for each article, our framework allows users to freely combine them to gain new insights into the media. Bottom: The two primary views of our dashboard: coverage (left) and events (right). The coverage page gives users a long-term view of what topics the media chooses to cover. Users can set filters to narrow the time frame and click through on each news topic to compare the subtopic-level distribution. They can also choose to color the stacked bars by the tone or political lean reflected in the coverage. The event view, on the other hand, offers a more fast-paced view of the daily news cycle and shows the top news events of the past day along with the amount of attention given to them by each publisher. Users can click on an event to view the top facts within it as well as trace their origin to the original articles themselves.
  • Figure 2: The Media Bias Detector framework. For each of the ten publishers, we take a snapshot of their homepage five times per day and scrape the top ranking news articles. We then use LLMs to extract multiple levels of structured labels from them at the article-level (topic, subtopic, lean, tone, type) and the sentence-level (type, tone, focus). In addition to this, we use OpenAI's text embedding model to obtain document embeddings for each article, which are used for event clustering. We then generate embeddings for every sentence in a news event and cluster them as well to extract the top facts about it. Our regular human-in-the-loop process adds oversight to this framework and ensures that the generated labels are accurate.
  • Figure 3: Top: A list of the data points we extract from each article. The data in black (1-5) are obtained automatically during the scraping process, and the ones in color (6-16) are extracted using LLMs. All of these data points can be combined together in various ways to answer a multitude of research questions about the media. Bottom: An example of GPT-4o's article lean and tone analysis and labels for the news article. The analyses generated by GPT-4o showcases its deep understanding of the subject matter of the article, how it connects to the political landscape, and how it's framing can be read as support for one party or the other.
  • Figure 4: Left: Total number of unique articles in our dataset from each publisher in 2024, compared to their current monthly traffic for May 2025 pressgazette_topnewswebsites. Right: The most popular news topics across this dataset, colored by category (shown in the inset). As expected in an election year, most of the news was dominated by politics, which comprises more than half of all articles in our dataset.
  • Figure 5: The imbalance between policy and election horse race coverage. The media primarily covered the election as a horse race instead of focusing on more substantive policy discussions to inform their readers of each candidate's and their party's agenda. Over 8,000 articles were published about the horse race across all publishers in 2024, compared to only 3,000 about all the policy topics put together (a significant fraction of which come from Breitbart's focus on immigration).
  • ...and 29 more figures