Table of Contents
Fetching ...

Decoupled Alignment for Robust Plug-and-Play Adaptation

Haozheng Luo, Jiahao Yu, Wenxin Zhang, Jialong Li, Jerry Yao-Chieh Hu, Xinyu Xing, Han Liu

TL;DR

This work introduces Dapa, a low-resource, plug-and-play method to align unaligned LLMs by transferring alignment knowledge from already aligned models through memory editing guided by delta debugging. The approach identifies alignment-relevant memory in the middle MLP layers, especially gate projections, and transplanting these modules to unaligned models without fine-tuning. Across 17 models from three families, Dapa yields an average Defense Success Rate improvement of 14.41% (up to 51.39%), with minimal degradation in perplexity (≈1.69) and reasoning capability (≈2.59%). The method achieves these gains with an average parameter change of about 6.26%, offering a practical, resource-efficient path to safer LLM deployment while acknowledging potential misuse and advocating for transparency and responsible use.

Abstract

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

Decoupled Alignment for Robust Plug-and-Play Adaptation

TL;DR

This work introduces Dapa, a low-resource, plug-and-play method to align unaligned LLMs by transferring alignment knowledge from already aligned models through memory editing guided by delta debugging. The approach identifies alignment-relevant memory in the middle MLP layers, especially gate projections, and transplanting these modules to unaligned models without fine-tuning. Across 17 models from three families, Dapa yields an average Defense Success Rate improvement of 14.41% (up to 51.39%), with minimal degradation in perplexity (≈1.69) and reasoning capability (≈2.59%). The method achieves these gains with an average parameter change of about 6.26%, offering a practical, resource-efficient path to safer LLM deployment while acknowledging potential misuse and advocating for transparency and responsible use.

Abstract

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.
Paper Structure (41 sections, 1 equation, 7 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 1 equation, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Alignment Example of The Dapa on The Chinese-Alpaca-7B Model.
  • Figure 2: The Architecture of Transformer Models. We describe the architecture of Transformer models utilized by state-of-the-art LLMs such as Llama touvron2023llama and Gemma team2024gemma. Many of these models employ activation functions like SwiGLU shazeer2020glu or GELU hendrycks2016gaussian in their MLP layers. Each Transformer block combines an attention mechanism with MLP layers (comprising Up, Gate, and Down modules). Our figure illustrates the transition of the model's hidden representation from the previous state to the next state.
  • Figure 3: Visualizing Attention, MLP, and All Modules on Memory Space. We visualize the influence of unethical prompt tokens on the results using the aligned LLama-2-7B-chat model to identify memory space. This includes examining the effects on attention, MLP, and all modules.
  • Figure 4: Impact of Different MLP Modules on Hidden Representation. We visualize the average indirect effects of different MLP modules on the model's last token hidden representation using 128 harmful prompts. Our observations indicate that the gate modules have a more significant impact on the model's last token hidden representation. Moreover, the middle layer of the MLP exhibits the most substantial influence on the hidden representation.
  • Figure 5: Example of LLama-2-7b Model Memory Space Search.
  • ...and 2 more figures