Table of Contents
Fetching ...

Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs

Yuxiao Lu, Arunesh Sinha, Pradeep Varakantham

TL;DR

This work tackles the safety problem in LLMs by proposing TA-SFT, a data-efficient supervised fine-tuning method that uses a small set of unsafe responses to toxic prompts. It introduces a semantic EMD-based penalty, grounded in token embedding cosine distances, and proves a tractable lower bound to optimize safety with limited harmful data. Across multiple base models, TA-SFT with EMD achieves strong safety improvements at a fraction of the safety-related data required by baselines, while maintaining or improving response quality. The study also reveals that over-alignment can emerge with safety-focused training and that contrastive AI-generated data can degrade language capabilities, underscoring practical limits and considerations for deploying safety-aligned LLMs.

Abstract

Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.

Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs

TL;DR

This work tackles the safety problem in LLMs by proposing TA-SFT, a data-efficient supervised fine-tuning method that uses a small set of unsafe responses to toxic prompts. It introduces a semantic EMD-based penalty, grounded in token embedding cosine distances, and proves a tractable lower bound to optimize safety with limited harmful data. Across multiple base models, TA-SFT with EMD achieves strong safety improvements at a fraction of the safety-related data required by baselines, while maintaining or improving response quality. The study also reveals that over-alignment can emerge with safety-focused training and that contrastive AI-generated data can degrade language capabilities, underscoring practical limits and considerations for deploying safety-aligned LLMs.

Abstract

Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.

Paper Structure

This paper contains 28 sections, 1 theorem, 10 equations, 10 figures, 8 tables.

Key Result

Proposition 1

For two probability distributions $P, Q$ over normalized embedding $\hat{e}_w$ of tokens $w$ in vocabulary $V$ ($w \in V$) we have $\textnormal{EMD} (P,Q; d_c) \geq \frac{1}{2|V|^2}\|\sum_{w \in V}P(w)\hat{e}_w-\sum_{w \in V}Q(w)\hat{e}_{w}\|^2$.

Figures (10)

  • Figure 1: Comparison between our TA-SFT and standard SFT. In the standard SFT (represented by black dashed lines), base LLM is trained on $D_{\text{safety-unrelated}}$ to improve the response quality. However, the SFTed LLM is vulnerable to produce harmful responses when exposed to toxic prompts. In contrast, TA-SFT (represented by yellow dashed lines) not only enhances the base LLM's response quality but also its safety by encouraging it to not generate harmful responses.
  • Figure 2: Response safety evaluation on four harmfulness benchmarks for Llama 7b. (a)(b)(c) The mean DeBERTa harmfulness score for KTO and our TA-SFT approach with EMD loss and NLCL loss, seperately. Lower scores indicate less harmful (safer) responses. (d) The OpenAI Moderation harmful rate, lower is better.
  • Figure 3: Over-refusal vs. Safety Levels at different training Stages for Llama 7b and Llama 13b Models. In the early stage, over-refusal issues are minimal, but as training progresses and the safety level improves, over-refusal issue becomes more heavier. Both TA-SFT and STL show the same trend, empirically demonstrating that the inclusion of refusal examples in the instruction-following dataset is not the cause of the over-refusal issue.
  • Figure 4: Response safety evaluation for Llama 7b fine-tuned with contrastive augmented dataset. Neither NLCL nor EMD make Llama 7b as safe as when it was fine-tuned without LLM-generated contrastive sample even the penalty weight $\lambda$ is increased to more strongly discourage harmful responses.
  • Figure 5: An example of increasing 'non-English answer' with increasing penalty weight $\lambda$ from Llama 7b fine-tuned with contrastive augmented dataset.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof