Table of Contents
Fetching ...

Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives

Romain Cosentino, Sarath Shekkizhar, Adam Earle

TL;DR

This work presents a mechanistic analysis of two LSA-based in-context optimizers that alternately update from each other under potentially misaligned objectives. It establishes a theoretical framework where asymptotic errors depend on the objective gap and the prompt-induced geometry of each agent, yielding biased plateaus under fixed objectives and monotone degradation with increased misalignment. It further shows that enabling turn-by-turn adaptation via a helper agent can realize Newton-like acceleration, turning a fragile interaction into cooperative optimization. The authors validate the theory with both trained LSA agents and GPT-5-mini experiments, and offer practical design principles for aligning objectives, controlling prompt geometry, and enabling adaptive collaboration to improve robustness in multi-agent LLM systems.

Abstract

We develop and analyze a theoretical framework for agent-to-agent interactions in a simplified in-context linear regression setting. In our model, each agent is instantiated as a single-layer transformer with linear self-attention (LSA) trained to implement gradient-descent-like updates on a quadratic regression objective from in-context examples. We then study the coupled dynamics when two such LSA agents alternately update from each other's outputs under potentially misaligned fixed objectives. Within this framework, we characterize the generation dynamics and show that misalignment leads to a biased equilibrium where neither agent reaches its target, with residual errors predictable from the objective gap and the prompt-induced geometry. We further contrast this fixed objective regime with an adaptive multi-agent setting, wherein a helper agent updates a turn-based objective to implement a Newton-like step for the main agent, eliminating the plateau and accelerating its convergence. Experiments with trained LSA agents, as well as black-box GPT-5-mini runs on in-context linear regression tasks, are consistent with our theoretical predictions within this simplified setting. We view our framework as a mechanistic framework that links prompt geometry and objective misalignment to stability, bias, and robustness, and as a stepping stone toward analyzing more realistic multi-agent LLM systems.

Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives

TL;DR

This work presents a mechanistic analysis of two LSA-based in-context optimizers that alternately update from each other under potentially misaligned objectives. It establishes a theoretical framework where asymptotic errors depend on the objective gap and the prompt-induced geometry of each agent, yielding biased plateaus under fixed objectives and monotone degradation with increased misalignment. It further shows that enabling turn-by-turn adaptation via a helper agent can realize Newton-like acceleration, turning a fragile interaction into cooperative optimization. The authors validate the theory with both trained LSA agents and GPT-5-mini experiments, and offer practical design principles for aligning objectives, controlling prompt geometry, and enabling adaptive collaboration to improve robustness in multi-agent LLM systems.

Abstract

We develop and analyze a theoretical framework for agent-to-agent interactions in a simplified in-context linear regression setting. In our model, each agent is instantiated as a single-layer transformer with linear self-attention (LSA) trained to implement gradient-descent-like updates on a quadratic regression objective from in-context examples. We then study the coupled dynamics when two such LSA agents alternately update from each other's outputs under potentially misaligned fixed objectives. Within this framework, we characterize the generation dynamics and show that misalignment leads to a biased equilibrium where neither agent reaches its target, with residual errors predictable from the objective gap and the prompt-induced geometry. We further contrast this fixed objective regime with an adaptive multi-agent setting, wherein a helper agent updates a turn-based objective to implement a Newton-like step for the main agent, eliminating the plateau and accelerating its convergence. Experiments with trained LSA agents, as well as black-box GPT-5-mini runs on in-context linear regression tasks, are consistent with our theoretical predictions within this simplified setting. We view our framework as a mechanistic framework that links prompt geometry and objective misalignment to stability, bias, and robustness, and as a stepping stone toward analyzing more realistic multi-agent LLM systems.

Paper Structure

This paper contains 31 sections, 8 theorems, 85 equations, 4 figures, 2 algorithms.

Key Result

Proposition 1

Let $S:=S_W+S_U$ be invertible and let $\Delta=u^\star-w^\star$, then as $\eta \rightarrow 0$, (Proof in Appendix proof:first-direct)

Figures (4)

  • Figure 1: Plateau error vs objective alignment: (left) With aligned objectives, both agents converge cooperatively to the shared objective. Note that because of the $\sim 6^\circ$ angle between objective, the agents do not converge to $0$-error. (middle) With orthogonal objectives ($\sim 90^\circ$), convergence occurs toward a solution that does not advantage either agent. (right) With opposite ($\sim 174^\circ$) objectives, the dynamic is similar to the orthogonal objective case. Note that $(i)$ whether agent $U$ or agent $W$ converges to a better error is induced by the prompt geometry, and $(ii)$ in all cases here, neither agent converges to a $0$-error solution. These two key points are central to the characterization we provide in Section \ref{['sec:axa-dynamic']}.
  • Figure 2: Plateau error v.s. objective angle - Plateau error of Agents $W$ (blue) and $U$ (orange) as a function of the objective alignment angle ($1000$ LSA agent-to-agent interactions). We display the theoretical bounds from Corollary \ref{['cor:angle-only']} for each agent (lower and upper). As the bounds in Corollary \ref{['cor:angle-only']} characterize, larger alignment angles correspond to higher plateau errors.
  • Figure 3: Cooperative Agents - We compare the convergence of a single agent $W$ (blue) to the same agent interacting with a cooperative helper $U$ for only $3$ alternating steps (orange). The helper's objective is updated dynamically at each turn using only turn-local quantities $(u_t, w_{t+1}, S_W, S_U, \eta)$. Following the analytic construction in Corollary \ref{['cor:newton-realized-main']}, the helper computes a temporary target $u_t^\star = w_{t+1} - [I + (\eta S_U)^{-1}(I - \eta S_U)] z_{t+1}$. This LSA-agents experiment highlights the fact that a helper agent can improve another agent's convergence rate by shaping turn-based objectives.
  • Figure 4: White-box agent-to-agent attack. We evaluate the adversarial algorithm proposed in Algorithm \ref{['alg:adv-Xgamma-line']} from Section \ref{['sec:ker-formulation']} under three objective-gap settings—orthogonal, scaled, and opposite, e.g., opposite is defined as $u^\star = - w^\star$. Each panel plots the mean trajectory across $100$ runs with shaded $\pm$ std bands (learning rate $\eta\!=\!0.005$). Left: distance of the victim (Agent $W$) to its target $w^\star$ over interaction steps. In all conditions, $W$ converges to a nonzero plateau whose level depends on the gap geometry, as predicted by Proposition \ref{['prop:first-direct']} and the angle bounds in Corollary \ref{['cor:angle-only']}. Right: distance of the attacker (Agent $U$) to $u^\star$. Consistent with the kernel criterion $(I-\eta S_U)S_W\Delta=0$, $U$ rapidly drives its error to (near-)zero, yielding one-sided success. Top: GPT5-mini agents ,early-step variability reflects model decoding the noise but does not alter the outcome. Bottom: LSA-trained agents, same protocol; Overall, both agent-base match the theory: anisotropy plus misalignment induces a predictable bias for $W$, while the adversarial spike in $S_U$ yields fast convergence for $U$.

Theorems & Definitions (16)

  • Proposition 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Proposition 2
  • Corollary 4
  • Corollary 5
  • Lemma 1
  • proof
  • proof
  • ...and 6 more