Table of Contents
Fetching ...

Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, Charith Peris

TL;DR

This work proposes a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control and shows that this approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.

Abstract

We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. Given a training corpus and control criteria formulated as a sequence-level constraint on model outputs, our method fine-tunes the LLM on the training corpus while enhancing constraint satisfaction with minimal impact on its utility and generation quality. Specifically, our approach regularizes the LLM training by penalizing the KL divergence between the desired output distribution, which satisfies the constraints, and the LLM's posterior. This regularization term can be approximated by an auxiliary model trained to decompose the sequence-level constraints into token-level guidance, allowing the term to be measured by a closed-form formulation. To further improve efficiency, we design a parallel scheme for concurrently updating both the LLM and the auxiliary model. We evaluate the empirical performance of our approach by controlling the toxicity when training an LLM. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.

Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification

TL;DR

This work proposes a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control and shows that this approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.

Abstract

We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. Given a training corpus and control criteria formulated as a sequence-level constraint on model outputs, our method fine-tunes the LLM on the training corpus while enhancing constraint satisfaction with minimal impact on its utility and generation quality. Specifically, our approach regularizes the LLM training by penalizing the KL divergence between the desired output distribution, which satisfies the constraints, and the LLM's posterior. This regularization term can be approximated by an auxiliary model trained to decompose the sequence-level constraints into token-level guidance, allowing the term to be measured by a closed-form formulation. To further improve efficiency, we design a parallel scheme for concurrently updating both the LLM and the auxiliary model. We evaluate the empirical performance of our approach by controlling the toxicity when training an LLM. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
Paper Structure (31 sections, 12 equations, 3 figures, 4 tables)

This paper contains 31 sections, 12 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A conceptually visualization of base LLM distribution $p_\theta$ and optimal distribution $q^*$ in fine-tuning. The polygon is representing the feasible region $Q$ where the constraints are satisfied. On (a) it shows the regularizer term is defined as the closest distance from $p_\theta$ to $Q$. Regularized by KL-divergence from $q$, on (b) we show the LLM distribution $p_\theta$ is gradually pushed towards the feasible region.
  • Figure 2: An illustration of sequential and parallel fine-tuning for three iterations. We use $T$ (time step) to indicate the time. Oracle, symbolizes the process of sampling data from an LLM, labeling with an oracle, and training the NADO model. On the left, we show sequential execution with the grey arrows showing the direction of flow. On the right, we show the parallelized execution. Note that in this case, all components (left to right) of each iteration are run at the same time step (except in iteration 1). Note also, that the grey dashed arrows (from iteration 2 onwards) do not flow across components within the same iteration level, indicating the independence of each component from other components in the same level. This allows them to be executed in parallel.
  • Figure 3: The trade-off curve between ToxiGen performance and Commonsense reasoning performance when fine-tuning Llama-7B model with our proposed approach with adaptive regularizer, compared to the listed baselines in Table \ref{['tab:multi']}. The trade-off is controlled by the coefficient $\lambda$ in Eq. \ref{['eq:adaptive']}. We observe that to control the language model in the same level of toxicity, our approach, with adaptive regularizer, achieves the best commonsense reasoning performance compared to the listed methods.