Table of Contents
Fetching ...

Chaotic Dynamics in Multi-LLM Deliberation

Hajime Shimao, Warut Khern-am-nuai, Sung Joo Kim

TL;DR

The results support stability auditing as a core design requirement for multi-LLM governance systems and quantify inter-run sensitivity using an empirical Lyapunov exponent derived from trajectory divergence in committee mean preferences.

Abstract

Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent ($\hatλ$) derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at $T=0$ identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the $T=0$ regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ($\hatλ=0.0541$ and $0.0947$, respectively), while homogeneous no-role committees also remain in a positive-divergence regime ($\hatλ=0.0221$). The combined mixed+roles condition is less unstable than mixed+no-role ($\hatλ=0.0519$ vs $0.0947$), showing non-additive interaction. Mechanistically, Chair-role ablation reduces $\hatλ$ most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.

Chaotic Dynamics in Multi-LLM Deliberation

TL;DR

The results support stability auditing as a core design requirement for multi-LLM governance systems and quantify inter-run sensitivity using an empirical Lyapunov exponent derived from trajectory divergence in committee mean preferences.

Abstract

Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent () derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ( and , respectively), while homogeneous no-role committees also remain in a positive-divergence regime (). The combined mixed+roles condition is less unstable than mixed+no-role ( vs ), showing non-additive interaction. Mechanistically, Chair-role ablation reduces most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.
Paper Structure (5 sections, 1 equation, 4 figures)

This paper contains 5 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Two design routes to instability and their interaction (HL-01, $T=0$). (A) Mean pairwise trajectory distance $D(t)$ for three conditions: uniform NoRoles (positive but lower-divergence baseline), uniform Roles (Route A), and mixed NoRoles (Route B). Both Route A and Route B further amplify divergence. (B) 2$\times$2 matrix crossing roles and model composition. The mixed+roles cell is less unstable than mixed+no-role, showing non-additive interaction. Realized sample sizes are shown in-cell (uniform+roles has slight attrition at $n=19$; all other cells are $n=20$).
  • Figure 2: Chair mechanism in HL-01 and cross-scenario heterogeneity. (A) HL-01 role-ablation effects on $\Delta\hat{\lambda}=\hat{\lambda}_{\text{full roles}}-\hat{\lambda}_{\text{ablated}}$. Chair ablation yields the largest reduction among role mandates (with a no-role reference bar shown for context). (B) Scenario-level Chair effect across IM-01, HL-01, CL-01, SP-03, and AI-01, reported as $\Delta\hat{\lambda}=\hat{\lambda}_{\text{full roles}}-\hat{\lambda}_{\text{ablate Chair}}$. Point estimates are positive in all five scenarios, with strongest and best-resolved effects in IM-01 and HL-01.
  • Figure 3: Protocol intervention and falsification tests via memory-depth control. (A) Intervention: reducing memory window from $k=15$ to $k=3$ attenuates divergence across four scenarios. (B) Falsification target: collapsing memory to $k=1$ lowers $\hat{\lambda}$ in IM-01 and CL-01 relative to baseline Roles. Together these tests support a feedback-memory amplification pathway.
  • Figure 4: Full cross-scenario 2$\times$2 instability landscape at $T=0$. Heatmap of $\hat{\lambda}$ across 12 scenarios for uniform/mixed $\times$ no-role/roles conditions. This panel provides the completed benchmark matrix for the two-route claim. Mixed-model cells are all estimated at $n=20$; uniform cells have realized $n$ displayed in-cell where slight attrition occurred.