DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

Herun Wan; Shangbin Feng; Zhaoxuan Tan; Heng Wang; Yulia Tsvetkov; Minnan Luo

DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

Herun Wan, Shangbin Feng, Zhaoxuan Tan, Heng Wang, Yulia Tsvetkov, Minnan Luo

TL;DR

DELL tackles misinformation detection by integrating LLMs at three stages: generating diverse user reactions to ground articles, producing explainable proxy-task explanations to enrich article contexts, and using LLM-guided selective ensemble of task-specific experts for calibrated predictions. The approach is validated across seven datasets and three LLMs, achieving state-of-the-art macro F1-scores with gains up to 16.8% and improved calibration. Key findings show that LLM-generated reactions ground articles effectively, proxy-task explanations enrich representations for better detection, and expert merging with confidence signals yields well-calibrated decisions. Overall, DELL demonstrates that carefully structured LLM integration—grounded reactions, explainable tasks, and selective ensembling—can deliver robust and scalable misinformation detectors.

Abstract

Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. In this work, we propose DELL that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline: 1) LLMs could \emph{generate news reactions} to represent diverse perspectives and simulate user-news interaction networks; 2) LLMs could \emph{generate explanations} for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) LLMs could \emph{merge task-specific experts} and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. Extensive experiments on seven datasets with three LLMs demonstrate that DELL outperforms state-of-the-art baselines by up to 16.8\% in macro f1-score. Further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed LLM-guided expert merging helps produce better-calibrated predictions.

DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

TL;DR

Abstract

Paper Structure (44 sections, 2 equations, 10 figures, 16 tables, 1 algorithm)

This paper contains 44 sections, 2 equations, 10 figures, 16 tables, 1 algorithm.

Introduction
Methodology
Diverse Reaction Generation
Diverse User Attribute
Generating User-News Networks
Explainable Proxy Tasks
LLM-Based Expert Ensemble
Vanilla
Confidence
Selective
Experiment Settings
Models and Settings
Baselines
Tasks and Datasets
Results
...and 29 more sections

Figures (10)

Figure 1: Overview of DELL. We first employ LLMs to generate news reactions from diverse perspectives and form user-news interaction networks. We then design six explainable proxy tasks to refine the feature embeddings with LLM-generated explanations. We finally propose three LLM-based strategies to selectively merge the predictions of task-specific experts and enhance calibration.
Figure 2: GPT-4 evaluation of whether the LLM-generated comments are related to the news article and match the user attributes, the higher the better from 1 to 5. We present the average value and standard deviation. Compared with randomly paired news ("Random" in the figure), user attributes, and comments, the generated comments generally conform to the user attributes and are relevant to the news articles.
Figure 3: Performance of DELL and baselines on LLM-mis when the comments are gradually removed. DELL shows great robustness to the availability of comments.
Figure 4: GPT-4 evaluation of the matching degree between different user groups. "CG" denotes "college grad", "non-CG" denotes "haven't graduated from college", and "HSD" denotes "have a high school diploma or less". The diagonal numbers are the highest both row-wise and column-wise, indicating that the generated comments are consistent with the user attributes.
Figure 5: Performance of DELL and baselines when the comments are generated from only one partisan perspective. Models trained on comments from one perspective generally perform worse than diverse comments.
...and 5 more figures

DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

TL;DR

Abstract

DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (10)