Efficient Thought Space Exploration through Strategic Intervention
Ziheng Li, Hengyi Cai, Xiaochi Wei, Yuchen Li, Shuaiqiang Wang, Zhi-Hong Deng, Dawei Yin
TL;DR
This work tackles the prohibitive cost of inference-time reasoning in large language models by showing that most next-token predictions align with a high-capacity oracle except at a sparse set of critical tokens. It introduces Hint-Practice Reasoning (HPR), a two-role framework where a hinter provides targeted hints and a practitioner executes the bulk of reasoning, guided by a Distributional Inconsistency Reduction (DIR) metric that identifies where interventions yield maximal benefit in a tree-structured thought space. By iteratively selecting high-DIR nodes and updating the reasoning tree, HPR reweights promising branches and suppresses low-probability paths, achieving comparable accuracy to strong baselines but at only about 1/5 the token cost and with similar or lower FLOPs. The approach demonstrates consistent improvements across arithmetic and commonsense benchmarks, supported by a theoretical foundation that connects path exploration quality with distribution alignment between the practitioner and hinter. Overall, HPR offers a practical, scalable method for efficient thought-space exploration with strong performance gains and broad applicability to complex reasoning tasks.
Abstract
While large language models (LLMs) demonstrate emerging reasoning capabilities, current inference-time expansion methods incur prohibitive computational costs by exhaustive sampling. Through analyzing decoding trajectories, we observe that most next-token predictions align well with the golden output, except for a few critical tokens that lead to deviations. Inspired by this phenomenon, we propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components: 1) a hinter (powerful LLM) that provides probabilistic guidance at critical decision points, and 2) a practitioner (efficient smaller model) that executes major reasoning steps. The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), a theoretically-grounded metric that dynamically identifies intervention points by quantifying the divergence between practitioner's reasoning trajectory and hinter's expected distribution in a tree-structured probabilistic space. Through iterative tree updates guided by DIR, HPR reweights promising reasoning paths while deprioritizing low-probability branches. Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs: it achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, and outperforms existing methods by at most 5.1% absolute accuracy while maintaining similar or lower FLOPs.
