Table of Contents
Fetching ...

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Zhiyuan Yu, Qipeng Guo, Xuanjing Huang, Xipeng Qiu

TL;DR

ARISE addresses the challenge of evaluating test-time scaling for large reasoning models by introducing a per-sample, trajectory-based metric that penalizes degradation and wasted computation, paired with an adaptive sampling procedure to stabilize measurements under stochastic inference. The method computes adjacent-pair contributions with a ratio-based, magnitude-aware design and applies a variance-guided sampling protocol to allocate resources where uncertainty is highest. Empirical results across mathematical, scientific, coding, agentic, and multimodal tasks show ARISE provides more discriminative and stable assessments than traditional scaling metrics, with clear visibility into negative scaling and model evolution. The work demonstrates ARISE’s robustness, cross-dataset consistency, and practical utility for benchmarking and tracking progress in test-time scaling capabilities of modern reasoning models.

Abstract

Test-time scaling has emerged as a transformative paradigm for enhancing the performance of large reasoning models, enabling dynamic allocation of computational resources during inference. However, as the landscape of reasoning models rapidly expands, a critical question remains: how can we systematically compare and evaluate the test-time scaling capabilities across different models? In this paper, we introduce ARISE (Adaptive Resolution-aware Scaling Evaluation), a novel metric specifically designed to assess the test-time scaling effectiveness of large reasoning models. Unlike existing evaluation approaches, ARISE incorporates two key innovations: (1) sample-level awareness that effectively penalizes negative scaling behaviors where increased computation leads to performance degradation, and (2) a dynamic sampling mechanism that mitigates the impact of accuracy fluctuations and token count instability on the final assessment. We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains including mathematical reasoning, code generation, and agentic tasks. Our results demonstrate that ARISE provides a reliable and fine-grained measurement of test-time scaling capabilities, revealing significant variations in scaling efficiency across models. Notably, our evaluation identifies Claude Opus as exhibiting superior scaling characteristics compared to other contemporary reasoning models.

ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models

TL;DR

ARISE addresses the challenge of evaluating test-time scaling for large reasoning models by introducing a per-sample, trajectory-based metric that penalizes degradation and wasted computation, paired with an adaptive sampling procedure to stabilize measurements under stochastic inference. The method computes adjacent-pair contributions with a ratio-based, magnitude-aware design and applies a variance-guided sampling protocol to allocate resources where uncertainty is highest. Empirical results across mathematical, scientific, coding, agentic, and multimodal tasks show ARISE provides more discriminative and stable assessments than traditional scaling metrics, with clear visibility into negative scaling and model evolution. The work demonstrates ARISE’s robustness, cross-dataset consistency, and practical utility for benchmarking and tracking progress in test-time scaling capabilities of modern reasoning models.

Abstract

Test-time scaling has emerged as a transformative paradigm for enhancing the performance of large reasoning models, enabling dynamic allocation of computational resources during inference. However, as the landscape of reasoning models rapidly expands, a critical question remains: how can we systematically compare and evaluate the test-time scaling capabilities across different models? In this paper, we introduce ARISE (Adaptive Resolution-aware Scaling Evaluation), a novel metric specifically designed to assess the test-time scaling effectiveness of large reasoning models. Unlike existing evaluation approaches, ARISE incorporates two key innovations: (1) sample-level awareness that effectively penalizes negative scaling behaviors where increased computation leads to performance degradation, and (2) a dynamic sampling mechanism that mitigates the impact of accuracy fluctuations and token count instability on the final assessment. We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains including mathematical reasoning, code generation, and agentic tasks. Our results demonstrate that ARISE provides a reliable and fine-grained measurement of test-time scaling capabilities, revealing significant variations in scaling efficiency across models. Notably, our evaluation identifies Claude Opus as exhibiting superior scaling characteristics compared to other contemporary reasoning models.

Paper Structure

This paper contains 64 sections, 3 theorems, 24 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

The Scaling Metric is bounded: $\text{Scaling} \in \left[\frac{-1}{\delta_{\min}}, \frac{1}{\delta_{\min}}\right]$ where $\delta_{\min} = \min_{p_1,p_2} (\mathcal{T}(p_2) - \mathcal{T}(p_1))$.

Figures (8)

  • Figure 1: Limitations of slope-based metrics in test-time scaling evaluation. (a) When performance improves from $p_0$ to $p_1$ and $p_1'$, the steeper slope correctly rewards $p_1$ for achieving the same accuracy with fewer tokens. (b) When performance degrades, the slope metric incorrectly assigns a higher value to $p_1'$ despite it wasting more tokens for worse performance.
  • Figure 2: Comparison of ARISE and Scaling Metric across different Qwen3 models on code and agentic tasks. The x-axis shows model parameter counts where 0.6B, 1.7B, 4B, 8B, 14B, 32B correspond to Qwen-3 models, 30B corresponds to Qwen3-30B-A3B, and 235B corresponds to Qwen3-235B-A22B. ARISE values are shown on the left y-axis (green bars) while Scaling Metric values are shown on the right y-axis (blue bars). Error bars represent standard deviations across five independent runs. Complete results for all models are presented in Appendix Table \ref{['tab:arise_scaling_code_agentic']}.
  • Figure 3: Comparison of ARISE and Scaling Metric across state-of-the-art reasoning models on multimodal reasoning tasks. The evaluation encompasses three challenging benchmarks: (a) MMMU, (b) MathVista, and (c) CharXiv-Reasoning. ARISE values are shown on the left y-axis (green bars) while Scaling Metric values are shown on the right y-axis (orange bars).
  • Figure 4: Stability analysis of adaptive sampling on (a) GPT-5 and (b) Claude Opus 4.1. Box plots compare variance across six sampling strategies, demonstrating that adaptive sampling achieves superior stability.
  • Figure 5: Hyperparameter analysis of adaptive sampling using GPT-5 on AIME. (a) Threshold $\tau$: box plots show ARISE (left axis) while the line plot indicates total API calls (right axis). (b) Maximum attempts $m_{max}$: box plots display ARISE (left axis) while the line plot tracks unconverged samples failing to meet threshold $\tau$ (right axis).
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1: Scaling Metric Bounds
  • Proof 1
  • Theorem 2: Sample-Level ARISE Bounds
  • Proof 2
  • Theorem 3: ARISE Bounds
  • Proof 3