Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng
TL;DR
This work tackles stealthy data poisoning of LLMs during fine-tuning by introducing RFTC, a two-stage detector combining Reference-Filtration and TF-IDF clustering. The filtration stage leverages a reference model to flag suspicious responses via $P_n$ BLEU-based similarity, enriching poisoned samples, which are then clustered by TF-IDF to exploit intra-class distance and separate true backdoors from clean data. Empirical results on MT and QA tasks show RFTC achieves near-perfect true positive rates with zero false positives, while preserving downstream generation quality and reducing computational cost relative to baselines. The approach is robust to different reference models and adaptable to various trigger types (word, combination, syntactic), offering a practical defense for cleansing training data before or during LLM fine-tuning.
Abstract
Stealthy data poisoning during fine-tuning can backdoor large language models (LLMs), threatening downstream safety. Existing detectors either use classifier-style probability signals--ill-suited to generation--or rely on rewriting, which can degrade quality and even introduce new triggers. We address the practical need to efficiently remove poisoned examples before or during fine-tuning. We observe a robust signal in the response space: after applying TF-IDF to model responses, poisoned examples form compact clusters (driven by consistent malicious outputs), while clean examples remain dispersed. We leverage this with RFTC--Reference-Filtration + TF-IDF Clustering. RFTC first compares each example's response with that of a reference model and flags those with large deviations as suspicious; it then performs TF-IDF clustering on the suspicious set and identifies true poisoned examples using intra-class distance. On two machine translation datasets and one QA dataset, RFTC outperforms prior detectors in both detection accuracy and the downstream performance of the fine-tuned models. Ablations with different reference models further validate the effectiveness and robustness of Reference-Filtration.
