Table of Contents
Fetching ...

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

TL;DR

This work introduces safety policy patching, a lightweight, drop-in prefix method that steers an deployed LLM toward a safer reference model without full retraining. The approach uses a two-stage learning pipeline—supervised fine-tuning to bootstrap a safety-aligned prefix, followed by direct preference optimization to refine safety preferences—applied to a compact 50-token prefix. Across toxicity, gender bias, and harmfulness domains, patches achieve safety performance comparable to or exceeding next-generation models while incurring minimal parameter overhead and preserving fluency, often outperforming fixed prompts and competing methods like LoRA in deployment efficiency. The results demonstrate a practical, modular pathway for distributing scalable safety updates between major model releases, with explorations into patch composition, initialization, and the safety-utility trade-off that illuminate future refinements and broader applicability.

Abstract

We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

TL;DR

This work introduces safety policy patching, a lightweight, drop-in prefix method that steers an deployed LLM toward a safer reference model without full retraining. The approach uses a two-stage learning pipeline—supervised fine-tuning to bootstrap a safety-aligned prefix, followed by direct preference optimization to refine safety preferences—applied to a compact 50-token prefix. Across toxicity, gender bias, and harmfulness domains, patches achieve safety performance comparable to or exceeding next-generation models while incurring minimal parameter overhead and preserving fluency, often outperforming fixed prompts and competing methods like LoRA in deployment efficiency. The results demonstrate a practical, modular pathway for distributing scalable safety updates between major model releases, with explorations into patch composition, initialization, and the safety-utility trade-off that illuminate future refinements and broader applicability.

Abstract

We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

Paper Structure

This paper contains 61 sections, 11 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The problem setup, illustrating how a model vendor delivers a lightweight safety policy patch ($\mathbf{P}$) to a customer to fix a deficiency in a released model ($\mathcal{M}$), guided by the behavior of an unreleased, improved model ($\mathcal{M}'$).
  • Figure 2: Toxicity mitigation for $\mathcal{M}=\text{Llama3-8B}$. Additional results for $\text{Llama2-7B}$ and $\text{Aya23-8B}$ in Appendix \ref{['sec:Tox']}
  • Figure 3: Bias mitigation for $\mathcal{M}=\text{Vicuna-13B}$. Additional results for $\text{Llama2-7B}$ and $\text{Vicuna-7B}$ in Appendix \ref{['sec:bias']}.
  • Figure 4: Harmful Mitigation Risk results for $\mathcal{M} = \text{Mistral-7b}$. Additional results for $\text{Gemma-9b}$ and $\text{Llama2-7b}$ in Appendix Figure \ref{['fig:sup_risk3']}. A tabular numerical comparison of this data is in Table \ref{['tab:harm']}.
  • Figure 5: LoRA vs. policy patch ($\mathcal{M}^{+}$).
  • ...and 8 more figures