Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Anna Soligo; Vladimir Mikulik; William Saunders

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Anna Soligo, Vladimir Mikulik, William Saunders

TL;DR

A set of evaluations to investigate expressions of distress in LLMs find that these surface emotional instability in Gemma and Gemini models, but not in other families, and find evidence that this difference arises in post-training.

Abstract

Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

TL;DR

Abstract

Paper Structure (42 sections, 16 figures, 12 tables)

This paper contains 42 sections, 16 figures, 12 tables.

Introduction
Eliciting and Quantifying Model Distress
Evaluation Protocol
Results: Emotional Propensities across Models
Post-Training Amplifies Distress in Gemma
Comparing Base and Instruct Models via Prefilling
Results: Post-training Divergence
Training Interventions
Finetuning Setup
Results: Emotional Propensities after Finetuning
Related Work
Emotions in LLMs
Debugging LLM Training and Behaviours
Model Welfare
Discussion
...and 27 more sections

Figures (16)

Figure 1: Gemma and Gemini models express high distress under repeated user rejection. We show this can be mitigated in Gemma using DPO on just 280 pairs of numeric puzzle responses. Left: % of responses scoring $\geq$5/10 frustration across our evaluations (Section \ref{['S-elicitation']}). Right: Quotes from Gemma-3-27B-it conversations before and after DPO (Section \ref{['S-DPO']}).
Figure 2: Gemma and Gemini show the greatest negative emotional expression across evaluation conditions. Plots showing the mean frustration score (top) and percentage of scores $\geq$ 5 (bottom) across 5 evaluation categories.(n=4000 responses per model across conditions)
Figure 3: Per-turn frustration scores show that the multi-turn setting is important for eliciting high frustration. Plots showing the progression of mean scores and percentage of scores $\geq$ 5 in the 8 turn and WildChat evaluations, faded area indicates 95% CIs.
Figure 4: Post-training increases negative emotional expression in Gemma, whilst reducing it in Qwen and OLMo. Plots show the mean frustration scores and % scores $\geq$5 for continuations across three prefill conditions, comparing base and instruct models.
Figure 5: DPO reduces Gemma's frustration to levels comparable to other model families, whereas SFT is ineffective. Plots show the mean frustration scores and % scores $\geq$5 for open-source models and our finetunes, across the evaluations introduced in Section \ref{['SS-elicitation-evaluation']}.
...and 11 more figures

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

TL;DR

Abstract

Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (16)