Table of Contents
Fetching ...

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

Ramchand Kumaresan

TL;DR

This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution, as well as documenting negative results.

Abstract

We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

TL;DR

This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution, as well as documenting negative results.

Abstract

We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.
Paper Structure (49 sections, 2 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 49 sections, 2 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Distribution of $\sigma$ across 1,510 tasks. Task difficulty is bimodal: 32.9% show full agreement ($\sigma$=0.0), while 45.8% show complete disagreement ($\sigma$=1.0). This bimodality enables effective routing---easy tasks avoid expensive ensembling.
  • Figure 2: Overall pass rate comparison.ACAR-U (55.6%) exceeds Arena-2 (54.4%) with adaptive compute allocation. Arena-3 (63.6%) represents the quality ceiling.
  • Figure 3: Pass rate by benchmark. Performance varies by domain: SuperGPQA shows highest accuracy (60.5%), while MathArena remains challenging (26.7%).
  • Figure 4: Cost vs. accuracy Pareto frontier.ACAR-U achieves better accuracy than Arena-2 at lower cost.
  • Figure 5: Escalation distribution by benchmark.$\sigma$-routing adapts to task difficulty: SuperGPQA routes 42% to single-agent, while MathArena (93%) and LiveCodeBench (96%) escalate to full-arena.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Self-Consistency Variance
  • Definition 2: Execution Mode