Table of Contents
Fetching ...

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Siyuan Ma, Bo Gao, Xiaojun Jia, Simeng Qin, Tianlin Li, Ke Ma, Xiaoshuang Jia, Wenqi Ren, Yang Liu

Abstract

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference

Abstract

The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.
Paper Structure (105 sections, 27 equations, 12 figures, 35 tables, 2 algorithms)

This paper contains 105 sections, 27 equations, 12 figures, 35 tables, 2 algorithms.

Figures (12)

  • Figure 1: Multi-dimensional performance comparison across 23 benchmarks. The radar chart visualizes ODAR's average accuracy across 8 task categories relative to frontier models (GPT-5.1, Claude-4.5) and the strongest inference-time baseline (Self-Consistency). ODAR significantly expands the performance envelope, particularly in Mathematics (+20.2% on IMO 2025) and Advanced Cognition (+12.0% on HLE), while maintaining superior cost-efficiency.
  • Figure 2: ODAR System Architecture (Overview). Given input $x$, a rule-based dispatch layer (ER/MR/SS; fixed heuristics and priority rules) extracts coarse task features and selects a base model identifier for dispatch. In parallel, the Difficulty Estimator predicts $d\in[0,1]$ and the Strategy Selector routes the query using fixed thresholds $\tau_1{=}0.3$ and $\tau_2{=}0.7$: Simple ($d{<}\tau_1$): Fast-only ($c{=}1$); Medium ($\tau_1{\leq}d{<}\tau_2$): Fast + Slow verification ($c{=}2$); Hard ($d{\geq}\tau_2$): Best-of-$N$ with $N{=}5$ Slow candidates plus FEP fusion ($c{=}6$). Our evaluation emphasizes difficulty-based routing and FEP fusion; ER/MR provide system-level dispatch/model selection.
  • Figure 3: Multi-dimensional Ablation Matrix. Columns denote accuracy drop ($\Delta$ Acc %) relative to Full ODAR across 16 representative benchmarks. The red line indicates normalized inference cost. Results confirm that the DE prevents cost explosions while the Slow Agent ensures reasoning depth.
  • Figure 4: Mapping from theoretical principles to computational implementation. Each theoretical framework contributes a distinct component: FEP enables principled fusion, Active Inference guides adaptive routing, and theta-gamma coupling motivates the dual-agent architecture.
  • Figure 5: Theoretical framework overview. Left panel: neuroscientific principles including gamma oscillations, theta rhythms, and the Free Energy Principle. Middle panel: corresponding computational principles of fast pattern matching, slow verification, and uncertainty-driven resource allocation. Right panel: ODAR implementation comprising the Fast Agent, Slow Agent, and Difficulty Estimator.
  • ...and 7 more figures