Table of Contents
Fetching ...

Can Large Language Models Detect Misinformation in Scientific News Reporting?

Yupeng Cao, Aishwarya Muralidharan Nair, Nastaran Jamalipour Soofi, Elyon Eyimife, K. P. Subbalakshmi

TL;DR

This work targets misinformation in scientific news by introducing the CoSMis (SciNews) dataset, composed of 2,400 news articles (balanced between reliable and unreliable, human-written and LLM-generated) paired with related CORD-19 abstracts. It defines Dimensions of Scientific Validity (DoV) and proposes three LLM-based architectures (SERIf, SIf, D2I) to detect misrepresentation without requiring explicit claim generation, using zero-shot, few-shot, and DoV-guided chain-of-thought prompting. Across GPT-3.5/4 and various LLaMA models, the SIf architecture with DoV-CoT prompting achieves the strongest overall performance, while LLM-generated misinformation remains more challenging to detect than human-written content; DoV prompting also provides interpretable explanations via DoV scores. The study demonstrates that large language models can identify scientific misinformation in the wild without extensive training, and can generate rationales for their judgments, suggesting practical avenues for scalable, explainable misinformation detection in science communication.

Abstract

Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.

Can Large Language Models Detect Misinformation in Scientific News Reporting?

TL;DR

This work targets misinformation in scientific news by introducing the CoSMis (SciNews) dataset, composed of 2,400 news articles (balanced between reliable and unreliable, human-written and LLM-generated) paired with related CORD-19 abstracts. It defines Dimensions of Scientific Validity (DoV) and proposes three LLM-based architectures (SERIf, SIf, D2I) to detect misrepresentation without requiring explicit claim generation, using zero-shot, few-shot, and DoV-guided chain-of-thought prompting. Across GPT-3.5/4 and various LLaMA models, the SIf architecture with DoV-CoT prompting achieves the strongest overall performance, while LLM-generated misinformation remains more challenging to detect than human-written content; DoV prompting also provides interpretable explanations via DoV scores. The study demonstrates that large language models can identify scientific misinformation in the wild without extensive training, and can generate rationales for their judgments, suggesting practical avenues for scalable, explainable misinformation detection in science communication.

Abstract

Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.
Paper Structure (45 sections, 9 figures, 3 tables)

This paper contains 45 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The dataset construction process: ① utilizing publicly available datasets as well as web resources to collect human-written scientific news related to COVID-19 (Subsection \ref{['ssec:human-written']}), ② selecting abstracts from CORD-19 as resources to guide LLMs to generate articles using jailbreak prompt (Subsection \ref{['ssec:llm-gen']}), ③ the dataset is augmented with evidence corpus drawn from CORD-19 (Subsection \ref{['ssec:corpus']}).
  • Figure 2: Schematic of the designed jailbreak prompt.
  • Figure 3: Proposed Architectures. SERIf includes all three modules: Summarization, Sentence-level Evidence Retrieval, and Inference Module. SIf bypasses the evidence retrieval module while keeping the other two. D2I removes both the summarization and the explicit evidence retrieval module.
  • Figure 4: Comparison of two spider plot visualizations: The left side corresponds to the 'Unreliable' case, while the right side corresponds to the 'Reliable' case. By visualizing the 'axis of scientific validity,' we can clearly observe the process of the LLM applying DoV to evaluate scientific news and the resulting differences.
  • Figure 5:
  • ...and 4 more figures