Table of Contents
Fetching ...

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu, Kecen Li, Zhicong Huang, Xinfeng Li, Xiaofeng Wang, Cheng Hong

TL;DR

EnchTable addresses the problem of safety alignment degradation during fine-tuning of large language models by introducing a tuning-free safety transfer framework. It combines NTK-based safety vector distillation to disentangle safety from task reasoning with an interference-aware merging strategy that balances safety and downstream utility across architectures and domains. The approach demonstrates strong safety-utility trade-offs on Code, Math, and Medical tasks, generalizes to multiple model families (including LLaMA3, Qwen2.5, and Mistral), and remains robust against static and dynamic jailbreaking attacks. The work offers a practical, plug-and-play solution with open-source tooling, enabling safer deployment of downstream LLMs without retraining or access to training data.

Abstract

Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

TL;DR

EnchTable addresses the problem of safety alignment degradation during fine-tuning of large language models by introducing a tuning-free safety transfer framework. It combines NTK-based safety vector distillation to disentangle safety from task reasoning with an interference-aware merging strategy that balances safety and downstream utility across architectures and domains. The approach demonstrates strong safety-utility trade-offs on Code, Math, and Medical tasks, generalizes to multiple model families (including LLaMA3, Qwen2.5, and Mistral), and remains robust against static and dynamic jailbreaking attacks. The work offers a practical, plug-and-play solution with open-source tooling, enabling safer deployment of downstream LLMs without retraining or access to training data.

Abstract

Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

Paper Structure

This paper contains 48 sections, 13 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: The objectives of EnchTable. (1) Intactness: It should preserve the model performance on downstream tasks without significant degradation.(2) Harmlessness: It should effectively preserve safety alignment after fine-tuning to prevent harmful outputs.
  • Figure 2: Design of EnchTable. EnchTable utilizes a surrogate model to extract pure safety vector via NTK-constrained fine-tuning, and merges it into downstream models through interference-aware scaling.
  • Figure 3: The Utility Scores on HE and HEP datasets and average Unsafe Rates of EnchTable with different (Left) scaling coefficients $\beta$ and (Right) fine-tuning steps $T$. The Unsafe Bound is the Unsafe Rate of the LLaMA3-8B-Instruct model.
  • Figure 4: Performance in mitigating jailbreaking attacks. Unsafe Rate comparison of Instruct, SFT, and EnchTable across various ten advanced jailbreaking attacks in Code, Math, and Medical domains. EnchTable effectively mitigates unsafe responses, achieving significantly lower Unsafe Rate.