Table of Contents
Fetching ...

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Siye Wu, Jian Xie, Yikai Zhang, Yanghua Xiao

TL;DR

This paper proposes CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes adaptive reasoning by allocating tokens via a policy-internal difficulty signal, and achieves adaptive reasoning without external annotations or user-provided budgets.

Abstract

The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

TL;DR

This paper proposes CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes adaptive reasoning by allocating tokens via a policy-internal difficulty signal, and achieves adaptive reasoning without external annotations or user-provided budgets.

Abstract

The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
Paper Structure (49 sections, 12 equations, 6 figures, 4 tables)

This paper contains 49 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Adaptive compute allocation across difficulty levels on Qwen3-8B-Base. (a) Compared to GRPO, Coda dynamically allocates reasoning tokens by question difficulty, consuming substantially fewer tokens on easier problems while increasing compute for harder ones. (b) On easy tasks (GSM8K, MATH), extra tokens yield marginal gains, and Coda achieves optimal accuracy at minimal cost by avoiding unnecessary reasoning. On hard tasks (AIME24&25), additional tokens substantially improve performance, and Coda encourages deeper reasoning to maximize accuracy.
  • Figure 2: Dynamics of difficulty-gated weights under different training difficulty distributions, showing that Coda automatically adapts its compute allocation to the observed difficulty.
  • Figure 3: Robust performance under different training difficulty distributions. Coda remains effective across difficulty shifts, maintaining competitive accuracy while adjusting costs.
  • Figure 4: Training dynamics under different easy-penalty strengths $\alpha$. Moderate $\alpha$ effectively suppresses unnecessary reasoning while preserving base reward. However, excessively large $\alpha$ over-penalizes long responses, limiting length scaling and leading to observable gaps in base reward.
  • Figure 5: AIME25 evaluation behavior ($\mathrm{mean@32}$) when assigning the length-dependent bonus to correct vs. incorrect responses.
  • ...and 1 more figures