Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen
TL;DR
This work formalizes the safety-capability trade-off in fine-tuning large language models by developing a theoretical framework for two safety-aware strategies: Alignment Loss Constraint and Alignment Parameter Constraint. It derives explicit bounds on the safety alignment gap $G_s(P_\theta)$ and the capability gap $G_f(P_\theta)$, highlighting how proxy-safety data similarity, context overlap, and local parameter landscapes govern the trade-off; these insights are corroborated with numerical experiments on Llama-2-7B and related datasets. The results show that higher similarity between original and proxy safety data mitigates safety degradation, while greater context overlap can worsen the trade-off, and that restricting parameter updates (Case II) can improve safety at the expense of capability gains. Collectively, the theoretical and empirical findings offer practical guidance for designing safer LLM fine-tuning protocols and understanding the fundamental limits of safety when enhancing capabilities.
Abstract
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
