Table of Contents
Fetching ...

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen

TL;DR

This work formalizes the safety-capability trade-off in fine-tuning large language models by developing a theoretical framework for two safety-aware strategies: Alignment Loss Constraint and Alignment Parameter Constraint. It derives explicit bounds on the safety alignment gap $G_s(P_\theta)$ and the capability gap $G_f(P_\theta)$, highlighting how proxy-safety data similarity, context overlap, and local parameter landscapes govern the trade-off; these insights are corroborated with numerical experiments on Llama-2-7B and related datasets. The results show that higher similarity between original and proxy safety data mitigates safety degradation, while greater context overlap can worsen the trade-off, and that restricting parameter updates (Case II) can improve safety at the expense of capability gains. Collectively, the theoretical and empirical findings offer practical guidance for designing safer LLM fine-tuning protocols and understanding the fundamental limits of safety when enhancing capabilities.

Abstract

Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

TL;DR

This work formalizes the safety-capability trade-off in fine-tuning large language models by developing a theoretical framework for two safety-aware strategies: Alignment Loss Constraint and Alignment Parameter Constraint. It derives explicit bounds on the safety alignment gap and the capability gap , highlighting how proxy-safety data similarity, context overlap, and local parameter landscapes govern the trade-off; these insights are corroborated with numerical experiments on Llama-2-7B and related datasets. The results show that higher similarity between original and proxy safety data mitigates safety degradation, while greater context overlap can worsen the trade-off, and that restricting parameter updates (Case II) can improve safety at the expense of capability gains. Collectively, the theoretical and empirical findings offer practical guidance for designing safer LLM fine-tuning protocols and understanding the fundamental limits of safety when enhancing capabilities.

Abstract

Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Paper Structure

This paper contains 17 sections, 4 theorems, 27 equations, 3 figures.

Key Result

Theorem 1

Under Assumptions assumption:bounded log probability and assumption:realizable target, any solution of formulation:safety loss penalty denoted as $\theta$ satisfies the following safety alignment guarantee:

Figures (3)

  • Figure 1: Left: Alignment loss gap in Case I with varying penalty strength $\lambda$ and different proxy alignment datasets (indicated by the legend). Right: Safety-capable trade-off in Case I. The legend indicates [alignment dataset]-[fine-tuning dataset].
  • Figure 2: Left: Alignment loss gap in Case II with varying penalty strength $\lambda$ and different task datasets (indicated by the legend). Right: Safety-capability comparison of Case I and Case II.
  • Figure 3: Illustration of safety-capability trade-offs in LLM fine-tuning concerning context overlap. Here, context overlap refers to the overlap of input distributions between the proxy alignment data and the fine-tuning data in the case of alignment loss constraint (Case I).

Theorems & Definitions (4)

  • Theorem 1: Safety alignment loss gap in Case I
  • Theorem 2: Capability loss gap in Case I
  • Theorem 3: Safety alignment loss gap in Case II
  • Theorem 4: Capability loss gap in Case II