Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning
Yanrui Du, Sendong Zhao, Jiawei Cao, Ming Ma, Danyang Zhao, Shuren Qi, Fenglei Fan, Ting Liu, Bing Qin
TL;DR
This work investigates security vulnerabilities introduced by instruction fine-tuning (IFT) of large language models and presents SWAT, a secure-tuning approach. SWAT identifies a robust subset of modules, Mods_Rob, via classifier-guided robustness analysis and employs a two-phase training: Warm-Up on Mods_Rob to capture low-level features, then full IFT on that warmed set, thereby shifting the early learning burden away from fragile parameters. Across multiple datasets and LLMs, SWAT reduces security feature space drift, lowers Attack Success Rate and harmfulness scores, and preserves task performance, with additive benefits when combined with pre-training or post-training defenses. The method demonstrates generality, consistent improvements in Benign and Attack IFT scenarios, and offers a practical in-training defense that complements existing pre/post-training strategies.
Abstract
Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.
