On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs
Herun Wan, Minnan Luo, Zhixiong Su, Guang Dai, Xiang Zhao
TL;DR
This work investigates the risk that evidence pollution generated by LLMs can mislead detectors that rely on external evidence to identify malicious social text. It formalizes a pollution taxonomy with three families—basic, rephrase, and generate—and evaluates thirteen manipulation methods plus three defense strategies: machine-generated text detection, mixture of experts, and parameter updating. Across four malicious-text tasks and ten datasets, pollution degrades detector performance, with LLM-generated evidence proving the most effective at confusing predictions; however, defenses can mitigate some effects, albeit with practical constraints such as data requirements and computational costs. The analysis reveals that polluted evidence can be high-quality and harm model calibration, and when combined, pollution strategies can amplify the damage, especially for encoder-based LMs, underscoring the need for robust defenses in evidence-enhanced detection systems.
Abstract
Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis illustrates that polluted evidence (i) is of high quality, evaluated by metrics and humans; (ii) would compromise the model calibration, increasing expected calibration error up to 21.6%; and (iii) could be integrated to amplify the negative impact, especially for encoder-based LMs, where the accuracy drops by 21.8%.
