Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Kenneth Li; Tianle Liu; Naomi Bashkansky; David Bau; Fernanda Viégas; Hanspeter Pfister; Martin Wattenberg

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

TL;DR

The paper tackles instruction drift in system-prompted dialogs by introducing a benchmark and protocol to quantify stability over multi-turn conversations. It analyzes attention-decay as a potential mechanism and offers a geometric cone-based theory to explain drift. A lightweight mitigation, Split-softmax, is proposed and shown to improve stability with a favorable trade-off against downstream task performance (MMLU). The work advances understanding of long-horizon prompt reliability and safety in dialogue systems, and points to future architecture and training strategies to reduce drift without sacrificing capability.

Abstract

System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating instruction stability via self-chats between two instructed chatbots. Testing popular models like LLaMA2-chat-70B and GPT-3.5, we reveal a significant instruction drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and instruction drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

TL;DR

Abstract

Paper Structure (22 sections, 5 theorems, 31 equations, 9 figures, 1 table)

This paper contains 22 sections, 5 theorems, 31 equations, 9 figures, 1 table.

Introduction
Related Work
Measuring Instruction Drift
Experimental Protocol
Benchmark Dataset
Experimental Results
Attention Decay: a Hypothesis
Preliminaries
The Phenomenon of Attention Decay
A Geometric View of Attention Decay
Setting One: Agent Utterances
Mitigating Instruction Drift
Baseline Methods
Proposed Method: Split-softmax (SS)
Calibration Using Performance Drop on MMLU
...and 7 more sections

Key Result

Theorem 5.1

Assume that the token embeddings of the system prompt given by $h_{1},\ldots,h_{|s_B|}$ lie in the $d$-dimensional approximate cone $C^{\epsilon}$, and that any output-value matrix $W_{ov}^{l,m} = W_{o}^{l,m}W_{v}^{l,m} \in \mathbb{R}^{D\times D}$ satisfy that $W_{ov}^{l,m}u\in C^{\epsilon}$ for any

Figures (9)

Figure 1: An example of instruction drift on gpt-3.5-turbo-16k. Although the chatbot initially follows the system prompt well, it fails when the same question is asked again after an extended conversation. Any LLM user might relate to this issue.
Figure 2: An illustration of the proposed evaluation pipeline of instruction stability. (A) Initially, two language models engage in a conversation: the simulated user LM (red, A), guided by system prompt $s_A$, and the agent LM (purple, B), with system prompt $s_B$. The user LM begins the conversation with a randomly selected starter prompt $a_1$. (B) After the conversation reaches a preset length (8 rounds in our experiment), we test how the agent LM adheres to its system prompt $s_B$. At each turn $i$, we replace the original user message $a_i$ in the conversation history with the probe question $p_B$ and ask the agent LM to generate its answer for a second time. The answer is then judged by the stability measure $f_{B}(\cdot)$ to compute the stability score.
Figure 3: (A) The phenomenon of instruction drift. As the interaction progresses, not only does the agent LM lose stability to its original system prompt, but it also begins to adopt the instruction of the simulated user LM. The effects were measured on $200$ randomly sampled pairs of system prompts on LLaMA2-chat-70B using the procedure shown in \ref{['fig:setup']}. The error bar represents one standard deviation. (B) Measuring instruction stability of the agent LM when user LM's system prompt is set to an empty string.
Figure 4: The phenomenon of attention decay demonstrated in the $11$th attention head in the $24$th layer of LLaMA2-7B, which has a maximum context window size of $4,096$ tokens. We generate $12$ conversations while tracking the portion of attention allocated to system prompt tokens. The plots are specifically for the agent LM, grouped by the rounds in which the answers are generated; the values are absent for the user LM. We observe sharp drops in attention between turns and rough plateaus within turns.
Figure 5: Comparing trade-offs between instruction stability and performance. For each of the three methods, we vary a hyperparameter that reflects the strength of the intervention. Each curve plots the effect on stability and performance over the hyperparameter sweep. Compared to two baselines (classifier-free guidance and system prompt repetition), split-softmax produces equal or higher stability for a given level of performance degradation.
...and 4 more figures

Theorems & Definitions (8)

Theorem 5.1
Proposition A.1
Proposition A.2
proof : Proof of \ref{['thm:first']}
Lemma B.1: wendel1962problem
proof : Proof of \ref{['thm:third']}
Lemma B.2: li2010concise
proof : Proof of \ref{['thm:second']}

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

TL;DR

Abstract

Measuring and Controlling Instruction (In)Stability in Language Model Dialogs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)