ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models
Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Zhiyuan Yu, Qipeng Guo, Xuanjing Huang, Xipeng Qiu
TL;DR
ARISE addresses the challenge of evaluating test-time scaling for large reasoning models by introducing a per-sample, trajectory-based metric that penalizes degradation and wasted computation, paired with an adaptive sampling procedure to stabilize measurements under stochastic inference. The method computes adjacent-pair contributions with a ratio-based, magnitude-aware design and applies a variance-guided sampling protocol to allocate resources where uncertainty is highest. Empirical results across mathematical, scientific, coding, agentic, and multimodal tasks show ARISE provides more discriminative and stable assessments than traditional scaling metrics, with clear visibility into negative scaling and model evolution. The work demonstrates ARISE’s robustness, cross-dataset consistency, and practical utility for benchmarking and tracking progress in test-time scaling capabilities of modern reasoning models.
Abstract
Test-time scaling has emerged as a transformative paradigm for enhancing the performance of large reasoning models, enabling dynamic allocation of computational resources during inference. However, as the landscape of reasoning models rapidly expands, a critical question remains: how can we systematically compare and evaluate the test-time scaling capabilities across different models? In this paper, we introduce ARISE (Adaptive Resolution-aware Scaling Evaluation), a novel metric specifically designed to assess the test-time scaling effectiveness of large reasoning models. Unlike existing evaluation approaches, ARISE incorporates two key innovations: (1) sample-level awareness that effectively penalizes negative scaling behaviors where increased computation leads to performance degradation, and (2) a dynamic sampling mechanism that mitigates the impact of accuracy fluctuations and token count instability on the final assessment. We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains including mathematical reasoning, code generation, and agentic tasks. Our results demonstrate that ARISE provides a reliable and fine-grained measurement of test-time scaling capabilities, revealing significant variations in scaling efficiency across models. Notably, our evaluation identifies Claude Opus as exhibiting superior scaling characteristics compared to other contemporary reasoning models.
