Table of Contents
Fetching ...

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Natapong Nitarach

Abstract

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Abstract

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.

Paper Structure

This paper contains 38 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Model capability dominates. Per-attempt accuracy $\hat{p}$ vs. expected majority-vote score under Binomial voting at $N{=}8,16,32$. Four models at their empirical $(\hat{p},\,\text{score})$. The 17-point gap between gpt-oss-120b ($\hat{p}{=}0.69$) and all alternatives dwarfs every prompt optimization tested ($\pm$2 points). Inference-time tricks cannot bridge a capability gap.
  • Figure 2: Prompt diversity vs. score on gpt-oss-120b. Blue circles: individual baseline runs ($N{=}13$). Black diamonds: configuration means. Shaded band: baseline $\pm 1\sigma$. More diversity monotonically degrades performance.
  • Figure 3: Per-problem $\hat{\rho}$ vs. $\hat{p}$ across three models (10 problems each). Circles = Qwen3.5-35B-A3B ($N{=}16$); squares = gpt-oss-120b ($N{=}8$); hollow triangles = Nemotron-Super-120B ($N{=}3$, hollow because $\hat{\rho}{=}-0.500$ is mathematically forced for any $v_c{=}1$, $N{=}3$ outcome). Teal/blue/purple = correct final answer; red/orange = wrong. All 8 computable points show $\hat{\rho}<0$. Orange dotted line: mean $\hat{\rho}=-0.258$ (all 8); grey dotted: mean $=-0.113$ ($N{\geq}7$ only). Consensus problems ($\hat{p}{\approx}1$) are shown in shaded region. The independence assumption ($\rho{=}0$) is approximately valid; the true $\rho$ is slightly negative, leaving nothing for diversity strategies to exploit.
  • Figure 4: Qwen3.5-35B-A3B ablation on 10 local problems. Blue bars: baseline ($8/10$). Orange bars: underperform ($7/10$). Red labels: crashed configs. Nothing improves beyond baseline.
  • Figure 5: Complete ablation across all experiments. Blue bar: baseline (39.7). Orange bars: gpt-oss-120B interventions. Yellow-orange bars: diversity mixer variants. Teal: E11 N-ablation ($N{=}3$). Red: Nemotron-Super (cross-model). Purple: Qwen3.5-35B (cross-model). Green dashed line: baseline mean. No experiment reliably exceeds baseline.
  • ...and 2 more figures