Table of Contents
Fetching ...

Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Dipkumar Patel

Abstract

Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

Abstract

Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

Paper Structure

This paper contains 16 sections, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: DALC protocol. Three role-conditioned agents independently generate chain-of-thought rationales (Phase 1). Embeddings reveal representational collapse: cosine similarity 0.88--0.91 before projection. Optional Gram-Schmidt orthogonalization decorrelates embeddings (Phase 2). Agents re-answer with truncated hints from others, and diversity-weighted voting produces the final answer (Phase 3).
  • Figure 2: Accuracy vs. mean tokens per question across benchmarks and model scales (Qwen2.5). Marker shape encodes method (circle = DALC, square = SC, triangle = Single); fill color encodes benchmark. DALC variants achieve comparable or higher accuracy than SC at 25--34% lower token cost across all conditions.