Table of Contents
Fetching ...

Risk-Averse Finetuning of Large Language Models

Sapana Chaudhary, Ujwal Dinesha, Dileep Kalathil, Srinivas Shakkottai

TL;DR

This work proposes integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events by optimizing the risk measure of Conditional Value at Risk (CVaR).

Abstract

We consider the challenge of mitigating the generation of negative or toxic content by the Large Language Models (LLMs) in response to certain prompts. We propose integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events. By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks. Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.

Risk-Averse Finetuning of Large Language Models

TL;DR

This work proposes integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events by optimizing the risk measure of Conditional Value at Risk (CVaR).

Abstract

We consider the challenge of mitigating the generation of negative or toxic content by the Large Language Models (LLMs) in response to certain prompts. We propose integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events. By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks. Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.
Paper Structure (44 sections, 15 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 15 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: Environment reward distribution shift, and quantile plot for IMDB-Gen.
  • Figure 2: Environment reward distribution shift, and quantile plot for Jigsaw-Gen.
  • Figure 2: Sentiment score (Senti), perplexity (PP) and diversity evaluation metrics with GPT-2 base model on IMDB-Gen.
  • Figure 3: Tail sentiment score plotted for one seed.
  • Figure 4: Average environment rewards, and per batch returns during training for IMDB-Gen and Jigsaw-Gen.
  • ...and 12 more figures