Table of Contents
Fetching ...

SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, Pengfei Liu

TL;DR

The paper tackles the bottleneck in test-time compute scaling for mathematical reasoning caused by uniform resource allocation across sub-problems. It introduces SCALE, a cognitively inspired framework that decomposes problems, scores sub-problem difficulty, adaptively assigns System 1 or System 2 processing, and executes sub-problems with preserving context, thereby concentrating resources on challenging steps. Empirically, SCALE yields substantial accuracy gains (up to ~13.8 points on AIME25) and substantial token-efficiency improvements (33-53% fewer tokens) across multiple models; it also extends to non-reasoning models by generating high-quality synthetic traces for supervised fine-tuning, achieving large cross-architecture improvements. The approach validates the selective-allocation hypothesis, preserves inference-time scaling laws, and offers a practical, model-agnostic path to more efficient mathematical reasoning at inference time.

Abstract

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

TL;DR

The paper tackles the bottleneck in test-time compute scaling for mathematical reasoning caused by uniform resource allocation across sub-problems. It introduces SCALE, a cognitively inspired framework that decomposes problems, scores sub-problem difficulty, adaptively assigns System 1 or System 2 processing, and executes sub-problems with preserving context, thereby concentrating resources on challenging steps. Empirically, SCALE yields substantial accuracy gains (up to ~13.8 points on AIME25) and substantial token-efficiency improvements (33-53% fewer tokens) across multiple models; it also extends to non-reasoning models by generating high-quality synthetic traces for supervised fine-tuning, achieving large cross-architecture improvements. The approach validates the selective-allocation hypothesis, preserves inference-time scaling laws, and offers a practical, model-agnostic path to more efficient mathematical reasoning at inference time.

Abstract

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

Paper Structure

This paper contains 21 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: SCALE Framework Overview. SCALE operates through four stages: (1) Problem Decomposition - breaks the mathematical problem into sequential sub-problems; (2) Difficulty Assessment - computes difficulty scores for each sub-problem to distinguish between routine operations and computationally challenging sub-problems; (3) Adaptive Mode Selection - assigns sub-problems to either fast processing (System 1) or deliberate reasoning (System 2) based on difficulty threshold; (4) Sequential Execution - processes sub-problems with full context propagation. This selective resource allocation concentrates computation on challenging sub-problems while efficiently handling routine ones.
  • Figure 2: Inference-time scaling of SCALE for Qwen3-32B-SCALE across three benchmarks.