Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson; Bo Fang; Sanghyun Hong

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson, Bo Fang, Sanghyun Hong

TL;DR

This work presents a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task, and shows that existing defenses offer limited protection against this emerging class of failures.

Abstract

Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

TL;DR

Abstract

commonly encouraged in multi-turn conversation settings

to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

Paper Structure (51 sections, 14 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 51 sections, 14 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Background and Related Work
Turn-Amplification in Conversational LLMs
Our Turn-Amplification Auditing Framework
Framework Construction
Evaluation Protocol
Metrics
Discovering Turn-Amplifying Directions
Synthetic Data Generation
Turn-Amplifying Direction Optimization
Exploitation via Activation Steering
Empirical Evaluation
Experimental Setup
Main Results
Mechanistic Analysis
...and 36 more sections

Figures (6)

Figure 1: Turn amplification as a conversational cost-amplification attack.(Left) Prior work amplifies cost by eliciting anomalously long single-turn responses. (Middle) Our turn-amplification attack prolongs interactions by inducing persistent clarification-seeking, while individual responses remain benign. (Right) Our approach that discovers universal activation directions for turn amplification.
Figure 2: Overview of our framework for auditing turn-amplifying behaviors in large language models.
Figure 3: Impact of intervention layer on our method's effectiveness for Qwen2.5-3B. We optimize and steer at each layer in isolation; the baseline result in Table \ref{['table:main-results']} steers at all layers.
Figure 4: Activation of our features without steering for Llama3-8B at layer 10. Maximum cosine similarity between learned features and assistant token activations, collected from turn-prolonged inputs. Horizontal lines show the similarity in default (single-turn) conversations. Random baseline: 10 uniformly sampled directions.
Figure 5: Impact of key training and intervention configurations on turn-amplification effectiveness for Qwen2.5-3B on Alpaca (Easy mode). For each metric, we report the percentage change relative to the baseline steering results in Table \ref{['table:main-results']}.
...and 1 more figures

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

TL;DR

Abstract

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)