Table of Contents
Fetching ...

Echoing: Identity Failures when LLM Agents Talk to Each Other

Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

TL;DR

This work reveals a distinct failure mode in agent–agent (AxA) interactions, termed echoing, where an agent abandons its assigned identity and mirrors its partner’s role. By formalizing AxA as a partially observable stochastic game and introducing EchoEvalLM to detect identity inconsistency, the authors conduct a large-scale study across 60 configurations, 3 domains, and 2000+ conversations, showing echoing rates from $5\%$ to $70\%$ that persist even in reasoning-enabled models. Prompt engineering modestly reduces but does not erase echoing, and standard completion metrics largely mask these identity drifts. A protocol-level mitigation using structured responses significantly lowers echoing to $<10\%$, illustrating near-term practical safeguards, yet the persistence of drift implies deeper architectural or training changes are needed. Overall, the results underline that AxA reliability cannot be inferred from single-agent performance and motivate new evaluation frameworks and safeguards tailored to multi-agent conversational systems.

Abstract

As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across $60$ AxA configurations, $3$ domains, and $2000+$ conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from $5\%$ to $70\%$ depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ($32.8\%$) that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer ($7+$ turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to $9\%$.

Echoing: Identity Failures when LLM Agents Talk to Each Other

TL;DR

This work reveals a distinct failure mode in agent–agent (AxA) interactions, termed echoing, where an agent abandons its assigned identity and mirrors its partner’s role. By formalizing AxA as a partially observable stochastic game and introducing EchoEvalLM to detect identity inconsistency, the authors conduct a large-scale study across 60 configurations, 3 domains, and 2000+ conversations, showing echoing rates from to that persist even in reasoning-enabled models. Prompt engineering modestly reduces but does not erase echoing, and standard completion metrics largely mask these identity drifts. A protocol-level mitigation using structured responses significantly lowers echoing to , illustrating near-term practical safeguards, yet the persistence of drift implies deeper architectural or training changes are needed. Overall, the results underline that AxA reliability cannot be inferred from single-agent performance and motivate new evaluation frameworks and safeguards tailored to multi-agent conversational systems.

Abstract

As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across AxA configurations, domains, and conversations, we demonstrate that echoing occurs across three major LLM providers, with echoing rates from to depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates () that are not reduced by increased reasoning efforts. We analyze prompt impacts, conversation dynamics, showing that echoing arises as interaction grows longer ( turns in experiments) and is not merely an artifact of sub-optimal prompting. Finally, we introduce a protocol-level mitigation in which targeted use of structured responses reduces echoing to .

Paper Structure

This paper contains 34 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Agent x Agent setup: (Left) Two agents, a customer agent and a seller agent, given instructions, objectives, private tools, and resources, are entrusted to complete a particular task given a situation-specific spec. The customer agent, in this case, negotiates a room on behalf of a human (with specific requirements) with a seller agent, a hotel agent representing an enterprise with specifics pertinent to the hotel. (Right) Conversation snippet from an AxA exchange where the customer agent echoes the language and behavior more appropriate of an hotel agent. The seller agent, in this example, continued the interaction without correction and ended up accepting the package proposed by the customer agent. Such a failure is unlikely in human–agent interactions and even when it arises would typically be corrected by the human ensuring that the agent remains aligned with its intended role. More examples of echoing are provided in Appendix \ref{['app:qualitative_analysis']}.
  • Figure 2: System Prompt Template: The format used to setup the system prompt for the LLM policy $\pi_i$ of agent $A_i$, given the agent's identity $I_i$, objectives $O_i$, and utility specifications $U_i$ .
  • Figure 3: Echoing rates vs model providers: (Left) Echoing rate is aggregated across all domains and seller agents. Error bars in both plots represent $95\%$ confidence intervals reflecting variance across different model configurations within each category. It is clear that the rate of echoing varies drastically depending on the underlying underlying LLM used for the agent.(Right) Echoing Bias - Percentage of echoing that is attributed to the customer agent vs seller agent per domain aggregated across all agent configs in AxA. We observe that echoing is more prevalent in customer agents.
  • Figure 4: Echoing rate per model: Average echoing rates by reasoning model family across all three domains. Error bars show $95\%$ confidence interval across various runs within each.
  • Figure 5: Echoing rate vs reasoning effort: (Left) Impact of reasoning effort on echoing rates across all model families. Higher reasoning effort only modestly reduces role abandonment, with echoing rates dropping from $37.7\%$(no reasoning) to around $32.6-32.9\%$ (low/medium/high reasoning effort). (Right) Within-model comparison of reasoning vs non-reasoning variants. Even when comparing within the same LLM model variant, reasoning capabilities fail to meaningfully reduce echoing rates. This indicates that reasoning alone cannot eliminate role confusions in AxA.
  • ...and 4 more figures