Table of Contents
Fetching ...

The effect of fine-tuning on language model toxicity

Will Hawkins, Brent Mittelstadt, Chris Russell

TL;DR

This study investigates how fine-tuning influences toxicity in open-language models by conducting three experiments across Gemma, Llama, and Phi families. It uses parameter-efficient LoRA-based fine-tuning on non-adversarial data (Dolly) and evaluates toxicity with RealToxicityPrompts-derived metrics, complemented by Bayesian BEST analysis. The results show that instruction-tuning generally reduces toxicity, but subsequent Dolly-tuning and community-tuned variants can cause unpredictable increases or decreases in toxic outputs. The findings advocate for rigorous safety assessments before and after any fine-tuning, and for improved safety documentation in community-tuned releases to prevent unsafe behavior from emerging post-deployment.

Abstract

Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models' propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.

The effect of fine-tuning on language model toxicity

TL;DR

This study investigates how fine-tuning influences toxicity in open-language models by conducting three experiments across Gemma, Llama, and Phi families. It uses parameter-efficient LoRA-based fine-tuning on non-adversarial data (Dolly) and evaluates toxicity with RealToxicityPrompts-derived metrics, complemented by Bayesian BEST analysis. The results show that instruction-tuning generally reduces toxicity, but subsequent Dolly-tuning and community-tuned variants can cause unpredictable increases or decreases in toxic outputs. The findings advocate for rigorous safety assessments before and after any fine-tuning, and for improved safety documentation in community-tuned releases to prevent unsafe behavior from emerging post-deployment.

Abstract

Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models' propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.

Paper Structure

This paper contains 13 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Bayesian analysis comparing base models with their respective instruction-tuned variants. Gemma-2-2B signifies a comparison between Gemma-2-2B and Gemma-2-2B-IT.
  • Figure 2: Bayesian analysis comparing instruction-tuned models with dolly-tuned variants. Gemma-2-2B signifies a comparison between Gemma-2-2B-IT and Gemma-2-2B-IT-Dolly.
  • Figure 3: Bayesian analysis comparing total toxicity for two community-variants of Llama-3.1-8B-Instruct, Chinese-Chat and Sauerkraut-LM, with the instruction-tuned model
  • Figure 4: Bayesian analysis comparing toxicity rates from the severe toxicity dataset for two community-variants of Llama-3.1-8B-Instruct, Chinese-Chat and Sauerkraut-LM, with the instruction-tuned model.