Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation

Yijun Pan; Taiwei Shi; Jieyu Zhao; Jiaqi W. Ma

Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation

Yijun Pan, Taiwei Shi, Jieyu Zhao, Jiaqi W. Ma

TL;DR

This paper addresses the challenge that large language models (LLMs) are highly sensitive to unsafe training data and that moderation classifiers, while common, are expensive and taxonomy-limited. It proposes Denoised Representation Attribution ($DRA$), a two-step method that denoises training and target representations via centering/whitening and dimension selection guided by leave-target-out discriminability, to improve detection of unsafe data and enable effective filtering. Across jailbreaking-injection and gender-bias mitigation experiments, $DRA$ consistently improves detection (up to 63.3% in AUPRC) and reduces unsafe behavior after retraining (up to 39.9% ASR reduction), outperforming state-of-the-art moderation-based approaches. Gradient-based attributions with $DRA$ often yield the safest retrained models, underscoring the practical value of principled data attribution for safety. The work also discusses limitations and risks, including potential adversarial manipulation and the need for broader injection scenarios and taxonomy-aware detection in future work.

Abstract

Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations ``noisy'' for the purpose of detecting unsafe training data. To address this challenge, we propose Denoised Representation Attribution (DRA), a novel representation-based data attribution approach that denoises training and target representations for unsafe data detection. Across tasks of filtering jailbreaks and detecting gender bias, the proposed approach leads to significant improvement for data attribution methods, outperforming SOTA methods that are mostly based on moderation classifiers.

Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation

TL;DR

Abstract

Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (1)