Risk-Averse Finetuning of Large Language Models

Sapana Chaudhary; Ujwal Dinesha; Dileep Kalathil; Srinivas Shakkottai

Risk-Averse Finetuning of Large Language Models

Sapana Chaudhary, Ujwal Dinesha, Dileep Kalathil, Srinivas Shakkottai

TL;DR

This work proposes integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events by optimizing the risk measure of Conditional Value at Risk (CVaR).

Abstract

We consider the challenge of mitigating the generation of negative or toxic content by the Large Language Models (LLMs) in response to certain prompts. We propose integrating risk-averse principles into LLM fine-tuning to minimize the occurrence of harmful outputs, particularly rare but significant events. By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks. Empirical evaluations on sentiment modification and toxicity mitigation tasks demonstrate the efficacy of risk-averse reinforcement learning with human feedback (RLHF) in promoting a safer and more constructive online discourse environment.

Risk-Averse Finetuning of Large Language Models

TL;DR

Abstract

Paper Structure (44 sections, 15 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 15 equations, 17 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Alignment:
Preliminaries
Risk-Averse RLHF for LLM Fine-tuning
Experimental Evaluation
Baselines:
Tasks and Models:
Evaluation Metrics:
Results on Risk-Aversion
GPT-J:
Training Stability
RA-RLHF Hyperparameter Analysis
Conclusion
Acknowledgments
...and 29 more sections

Figures (17)

Figure 1: Environment reward distribution shift, and quantile plot for IMDB-Gen.
Figure 2: Environment reward distribution shift, and quantile plot for Jigsaw-Gen.
Figure 2: Sentiment score (Senti), perplexity (PP) and diversity evaluation metrics with GPT-2 base model on IMDB-Gen.
Figure 3: Tail sentiment score plotted for one seed.
Figure 4: Average environment rewards, and per batch returns during training for IMDB-Gen and Jigsaw-Gen.
...and 12 more figures

Risk-Averse Finetuning of Large Language Models

TL;DR

Abstract

Risk-Averse Finetuning of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)