Table of Contents
Fetching ...

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu

TL;DR

The paper tackles verbose reasoning in large language models distilled from large reasoning models by proposing PIR, a perplexity-based importance refinement that prunes low-importance functional steps while preserving progressive reasoning. It classifies reasoning steps into four cognitive patterns, uses a PIR score to quantify each functional step's contribution, and selectively prunes steps to reduce verbosity without sacrificing solution integrity. Empirical results on AIME, AMC, and GPQA Diamond across multiple data sources and model sizes show improved accuracy and significant token reductions, with robust test-time scaling. The work offers a practical, generalizable approach for deploying reasoning-enabled LLMs under latency and compute constraints.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

TL;DR

The paper tackles verbose reasoning in large language models distilled from large reasoning models by proposing PIR, a perplexity-based importance refinement that prunes low-importance functional steps while preserving progressive reasoning. It classifies reasoning steps into four cognitive patterns, uses a PIR score to quantify each functional step's contribution, and selectively prunes steps to reduce verbosity without sacrificing solution integrity. Empirical results on AIME, AMC, and GPQA Diamond across multiple data sources and model sizes show improved accuracy and significant token reductions, with robust test-time scaling. The work offers a practical, generalizable approach for deploying reasoning-enabled LLMs under latency and compute constraints.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

Paper Structure

This paper contains 38 sections, 2 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 2: PIR framework pipeline for reasoning optimization: raw reasoning is segmented into logical steps, step is classified into reasoning patterns, PIR value is calculated to quantify step importance, and low-value functional steps are filtered while preserving progressive reasoning, resulting in more efficient reasoning chains.
  • Figure 3: Impact of pruning ratio on model performance. This figure displays relative performance metrics (normalized to baseline) across different pruning ratios for AIME and AMC benchmarks. The horizontal dashed line represents the baseline performance (ratio=0).
  • Figure 4: Impact of PIR refinement across model sizes and benchmarks. Heatmaps show relative percentage changes between models trained with pruned versus original datasets. Blue indicates improvement: higher accuracy, shorter response length, or better efficiency.
  • Figure 5: A case where one training sample contains the four patterns.
  • Figure 6: The prompt to segment coherent sub-thinking sentences into cohesive reasoning steps.
  • ...and 3 more figures