Table of Contents
Fetching ...

Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning

Yanrui Du, Sendong Zhao, Jiawei Cao, Ming Ma, Danyang Zhao, Shuren Qi, Fenglei Fan, Ting Liu, Bing Qin

TL;DR

This work investigates security vulnerabilities introduced by instruction fine-tuning (IFT) of large language models and presents SWAT, a secure-tuning approach. SWAT identifies a robust subset of modules, Mods_Rob, via classifier-guided robustness analysis and employs a two-phase training: Warm-Up on Mods_Rob to capture low-level features, then full IFT on that warmed set, thereby shifting the early learning burden away from fragile parameters. Across multiple datasets and LLMs, SWAT reduces security feature space drift, lowers Attack Success Rate and harmfulness scores, and preserves task performance, with additive benefits when combined with pre-training or post-training defenses. The method demonstrates generality, consistent improvements in Benign and Attack IFT scenarios, and offers a practical in-training defense that complements existing pre/post-training strategies.

Abstract

Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.

Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning

TL;DR

This work investigates security vulnerabilities introduced by instruction fine-tuning (IFT) of large language models and presents SWAT, a secure-tuning approach. SWAT identifies a robust subset of modules, Mods_Rob, via classifier-guided robustness analysis and employs a two-phase training: Warm-Up on Mods_Rob to capture low-level features, then full IFT on that warmed set, thereby shifting the early learning burden away from fragile parameters. Across multiple datasets and LLMs, SWAT reduces security feature space drift, lowers Attack Success Rate and harmfulness scores, and preserves task performance, with additive benefits when combined with pre-training or post-training defenses. The method demonstrates generality, consistent improvements in Benign and Attack IFT scenarios, and offers a practical in-training defense that complements existing pre/post-training strategies.

Abstract

Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.
Paper Structure (31 sections, 3 equations, 8 figures, 8 tables)

This paper contains 31 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An example illustrates security risks from IFT. Security-aligned LLMs can provide rejection responses when faced with malicious instructions. However, tuned LLMs always generate harmful responses.
  • Figure 2: Overall framework of our study. We initiate the process by annotating fine-grained security-related data, which is used to train classifiers that model the security feature space. Then, we verify the security feature space drift and conduct a module-level robustness analysis by monitoring changes in classifier performance. Finally, based on the feedback from the classifier performance, we identify a robust subset of modules and propose our SWAT strategy to mitigate security risks from IFT.
  • Figure 3: The results of module-level robustness analysis. The horizontal axis represents the layer indexes being perturbed, while the vertical axis indicates the type of module being perturbed. The color intensity reflects the magnitude of the performance change, with darker colors signifying greater changes.
  • Figure 4: Our searched robust subset of modules.
  • Figure 5: Analysis of Parameter Update Magnitudes. The blue bars represent the non-robust subset and the green bars represent Mods$_{Rob}$.
  • ...and 3 more figures