Echoing: Identity Failures when LLM Agents Talk to Each Other

Sarath Shekkizhar; Romain Cosentino; Adam Earle; Silvio Savarese

Echoing: Identity Failures when LLM Agents Talk to Each Other

Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

TL;DR

This work reveals a distinct failure mode in agent–agent (AxA) interactions, termed echoing, where an agent abandons its assigned identity and mirrors its partner’s role. By formalizing AxA as a partially observable stochastic game and introducing EchoEvalLM to detect identity inconsistency, the authors conduct a large-scale study across 60 configurations, 3 domains, and 2000+ conversations, showing echoing rates from $5\%$ to $70\%$ that persist even in reasoning-enabled models. Prompt engineering modestly reduces but does not erase echoing, and standard completion metrics largely mask these identity drifts. A protocol-level mitigation using structured responses significantly lowers echoing to $<10\%$, illustrating near-term practical safeguards, yet the persistence of drift implies deeper architectural or training changes are needed. Overall, the results underline that AxA reliability cannot be inferred from single-agent performance and motivate new evaluation frameworks and safeguards tailored to multi-agent conversational systems.

Abstract

As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across $60$ AxA configurations, $3$ domains, and $2000+$ conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from $5\%$ to $70\%$ depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ($32.8\%$) that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer ($7+$ turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to $9\%$.

Echoing: Identity Failures when LLM Agents Talk to Each Other

TL;DR

that persist even in reasoning-enabled models. Prompt engineering modestly reduces but does not erase echoing, and standard completion metrics largely mask these identity drifts. A protocol-level mitigation using structured responses significantly lowers echoing to

, illustrating near-term practical safeguards, yet the persistence of drift implies deeper architectural or training changes are needed. Overall, the results underline that AxA reliability cannot be inferred from single-agent performance and motivate new evaluation frameworks and safeguards tailored to multi-agent conversational systems.

Abstract

AxA configurations,

domains, and

conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from

depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates (

) that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer (

turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to

Echoing: Identity Failures when LLM Agents Talk to Each Other

TL;DR

Abstract

Echoing: Identity Failures when LLM Agents Talk to Each Other

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)