Configurable Safety Tuning of Language Models with Synthetic Preference Data
Victor Gallego
TL;DR
Problem: runtime configurability of safety in LLMs is difficult; existing approaches fix safety with predefined rules or single-direction fine-tuning. Approach: Configurable Safety Tuning extends Direct Preference Optimization by conditioning preferences on a system prompt that encodes a safety configuration and by using synthetic preference data to support multiple settings at inference. Method: an augmented dataset consisting of $(s_0, x, y_0, y_1)$ and $(s_1, x, y_1, y_0)$ with the DPO objective $L_{DPO}(θ) = -\frac{1}{n} \sum_{i=1}^n \log \hat{p}_{\theta} (y^i_1 \succ y^i_0|x^i)$ where $\hat{p}_{\theta} (y_1 \succ y_0|x, s) = \sigma\left(\beta \log \frac{\pi_{\theta}(y_1|x)}{\pi_{ref}(y_1|x)} - \beta \log \frac{\pi_{\theta}(y_0|x)}{\pi_{ref}(y_0|x)}\right)$. Findings: CST enables configurable safety at inference while preserving general capabilities, demonstrated on two open-source LLMs and through multi-task prompts, with no extra synthetic preference data beyond existing pipelines. Significance: empowers deployers to tailor safety levels post-training, broadening safe deployment of open LLMs.
Abstract
State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at https://github.com/vicgalle/configurable-safety-tuning
