Table of Contents
Fetching ...

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang

TL;DR

This work proposes token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model, which enables accurate identification and removal of unsafe tokens.

Abstract

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.

Token-level Data Selection for Safe LLM Fine-tuning

TL;DR

This work proposes token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model, which enables accurate identification and removal of unsafe tokens.

Abstract

Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at https://github.com/Polly-LYP/TOSS.
Paper Structure (30 sections, 4 equations, 6 figures, 20 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 6 figures, 20 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: High-level comparison of sample-level and token-level selection methods for safe LLM customization. Right: Comparison of sample-level and token-level selection methods on different custom datasets with varying ratios of harmful data $r$. Our token-level data selection method achieves significant improvements in both safety and utility compared to the sample-level one.
  • Figure 2: Left: Per-token KL divergence difference ($\bigtriangleup KL$) across token positions. The customized model diverges from the safe base model and shifts towards the safety-degraded model when the difference increases. Right: Win rate comparison between the naive discarding method and the standard SFT method on safety (HEx-PHI) and utility (SLIMORCA) benchmarks, where the win rates are computed following the evaluation method described in the experiment section.
  • Figure 3: The overall pipeline of our token-level data selection method for safe LLM fine-tuning.
  • Figure 4: Left: Comparison of TOSS with its local ranking variant, showing that the global ranking strategy effectively improves both safety and utility. Middle: Comparison of TOSS with a sample-level variant, indicating that finer-grained token-level selection yields a better safety–utility trade-off. Right: Comparison of TOSS with two simplified variants, namely token-level selection guided solely by the safety-degraded model or solely by the utility-oriented model. The results highlight the complementary roles of the two models in discarding unsafe tokens while improving utility.
  • Figure 5: Comparison between the sample-level selection method and our token-level selection approach under varying discarding ratios on Llama-3-8B-Instruct. Across all discarding ratios, our method consistently achieves better trade-offs between safety and utility than the sample-level baseline.
  • ...and 1 more figures