Table of Contents
Fetching ...

On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs

Herun Wan, Minnan Luo, Zhixiong Su, Guang Dai, Xiang Zhao

TL;DR

This work investigates the risk that evidence pollution generated by LLMs can mislead detectors that rely on external evidence to identify malicious social text. It formalizes a pollution taxonomy with three families—basic, rephrase, and generate—and evaluates thirteen manipulation methods plus three defense strategies: machine-generated text detection, mixture of experts, and parameter updating. Across four malicious-text tasks and ten datasets, pollution degrades detector performance, with LLM-generated evidence proving the most effective at confusing predictions; however, defenses can mitigate some effects, albeit with practical constraints such as data requirements and computational costs. The analysis reveals that polluted evidence can be high-quality and harm model calibration, and when combined, pollution strategies can amplify the damage, especially for encoder-based LMs, underscoring the need for robust defenses in evidence-enhanced detection systems.

Abstract

Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis illustrates that polluted evidence (i) is of high quality, evaluated by metrics and humans; (ii) would compromise the model calibration, increasing expected calibration error up to 21.6%; and (iii) could be integrated to amplify the negative impact, especially for encoder-based LMs, where the accuracy drops by 21.8%.

On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs

TL;DR

This work investigates the risk that evidence pollution generated by LLMs can mislead detectors that rely on external evidence to identify malicious social text. It formalizes a pollution taxonomy with three families—basic, rephrase, and generate—and evaluates thirteen manipulation methods plus three defense strategies: machine-generated text detection, mixture of experts, and parameter updating. Across four malicious-text tasks and ten datasets, pollution degrades detector performance, with LLM-generated evidence proving the most effective at confusing predictions; however, defenses can mitigate some effects, albeit with practical constraints such as data requirements and computational costs. The analysis reveals that polluted evidence can be high-quality and harm model calibration, and when combined, pollution strategies can amplify the damage, especially for encoder-based LMs, underscoring the need for robust defenses in evidence-enhanced detection systems.

Abstract

Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis illustrates that polluted evidence (i) is of high quality, evaluated by metrics and humans; (ii) would compromise the model calibration, increasing expected calibration error up to 21.6%; and (iii) could be integrated to amplify the negative impact, especially for encoder-based LMs, where the accuracy drops by 21.8%.

Paper Structure

This paper contains 67 sections, 6 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: An overview of the Evidence Pollution, which illustrates the potential risk posed by LLMs. Malicious actors would manipulate the evidence by LLMs to confuse evidence-enhanced malicious social text detectors.
  • Figure 2: Out-of-domain machine-generated text detection performance of DeBERTa. DeBERTa struggles to conduct out-of-domain detection. Values in the red box show that DeBERTa generalizes worse on different types of evidence manipulation datasets.
  • Figure 3: The performance trend of Parameter Updating strategy with re-training data increasing. In some situations, this strategy could significantly improve the detection performance. However, it might fail when it meets Basic pollution, such as Reverse or models that are already trained well, such as GET. Meanwhile, the need for annotated data and the unknown when the training ends limit its practical application.
  • Figure 4: Calibration of existing detectors with the original and polluted evidence. ECE denotes expected calibration error, the lower the better. The dashed line indicates perfect calibration, while the color of the bar is darker when it is closer to perfect calibration. Evidence pollution could harm the model calibration.
  • Figure 5: Evaluation of the manipulated evidence. We evaluate the relevance between social text and corresponding evidence and the semantic-level and word-level similarity between original and rephrased evidence. The polluted evidence is of high quality.
  • ...and 11 more figures