Decoupled Alignment for Robust Plug-and-Play Adaptation
Haozheng Luo, Jiahao Yu, Wenxin Zhang, Jialong Li, Jerry Yao-Chieh Hu, Xinyu Xing, Han Liu
TL;DR
This work introduces Dapa, a low-resource, plug-and-play method to align unaligned LLMs by transferring alignment knowledge from already aligned models through memory editing guided by delta debugging. The approach identifies alignment-relevant memory in the middle MLP layers, especially gate projections, and transplanting these modules to unaligned models without fine-tuning. Across 17 models from three families, Dapa yields an average Defense Success Rate improvement of 14.41% (up to 51.39%), with minimal degradation in perplexity (≈1.69) and reasoning capability (≈2.59%). The method achieves these gains with an average parameter change of about 6.26%, offering a practical, resource-efficient path to safer LLM deployment while acknowledging potential misuse and advocating for transparency and responsible use.
Abstract
We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.
