Table of Contents
Fetching ...

Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites

Hans W. A. Hanley, Zakir Durumeric

TL;DR

This study investigates the prevalence and dynamics of machine-generated news articles across a large mix of mainstream and misinformation websites using a DeBERTa-based detector trained with perturbation and paraphrase augmentations. It analyzes 15.46 million articles from 3,074 sites (2022–2023), finding substantial growth in synthetic content—especially on misinformation sites and smaller outlets—with a noticeable spike following ChatGPT's release. The work also examines topic distributions and Reddit engagement to contextualize impact and demonstrates the detector's robustness against various adversarial generations, while candidly addressing limitations and the need for ongoing monitoring as LLMs evolve. Overall, the findings highlight rapid adoption of machine-generated content in the news ecosystem and underscore the importance of scalable, generalizable detection to mitigate misinformation diffusion.

Abstract

As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.

Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites

TL;DR

This study investigates the prevalence and dynamics of machine-generated news articles across a large mix of mainstream and misinformation websites using a DeBERTa-based detector trained with perturbation and paraphrase augmentations. It analyzes 15.46 million articles from 3,074 sites (2022–2023), finding substantial growth in synthetic content—especially on misinformation sites and smaller outlets—with a noticeable spike following ChatGPT's release. The work also examines topic distributions and Reddit engagement to contextualize impact and demonstrates the detector's robustness against various adversarial generations, while candidly addressing limitations and the need for ongoing monitoring as LLMs evolve. Overall, the findings highlight rapid adoption of machine-generated content in the news ecosystem and underscore the importance of scalable, generalizable detection to mitigate misinformation diffusion.

Abstract

As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.46 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 57.3% on mainstream websites while increasing by 474% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.
Paper Structure (9 sections, 8 figures, 11 tables)

This paper contains 9 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The average percentage of synthetic articles for all, misinformation, and mainstream websites. We provide 95% Normal confidence intervals.
  • Figure 2: Example first paragraph of an article classified by our system as machine-generated/synthetic.
  • Figure 3: The number of websites that published at least one synthetic article over a 30-day time span.
  • Figure 4: The number of articles that contained a common ChatGPT error message over time.
  • Figure 5: The average percentage of machine-generated/synthetic articles for misinformation/unreliable and mainstream/reliable news websites at different striations of popularity according to Google Chrome User Report (CrUX) from October 2022. All striations of misinformation websites experienced a small uptick of machine-generated content around November 30, 2022, the release date of OpenAI's ChatGPT. We note that the scale of synthetic content is much larger for websites with popularity rank $>$10M.
  • ...and 3 more figures