Table of Contents
Fetching ...

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

Sophie Li, Nicholas Huang, Nayan Saxena, Nina Luo, Vincent Lin, Kevin Zhu, Sunishchal Dev

TL;DR

KAPPA introduces KL-Adjusted Pruned Path Algorithm, an inference-time pruning method that uses a principled, three-signal scoring function combining KL-divergence (information gain), confidence, and entropy to progressively prune unpromising reasoning branches during multi-path decoding. The method alternates between a draft phase that encourages diverse exploration and a gating phase that prunes branches on a trajectory-weighted score, followed by a continuation phase that completes the best surviving path. Across GSM8K and MATH500 benchmarks with two open-source models, KAPPA substantially reduces memory and token usage relative to BoN, with up to ~60% peak memory savings and ~90% token reductions, while maintaining or modestly improving accuracy, particularly for smaller models. The results demonstrate that KAPPA stabilizes performance in smaller models and offers practical inference-time efficiency gains, though larger models may require adaptive pruning schedules to avoid over-pruning. This work provides a hardware-efficient, training-free approach to scalable chain-of-thought reasoning in LLMs, with clear implications for real-time reasoning tasks and deployment on resource-constrained devices.

Abstract

Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

TL;DR

KAPPA introduces KL-Adjusted Pruned Path Algorithm, an inference-time pruning method that uses a principled, three-signal scoring function combining KL-divergence (information gain), confidence, and entropy to progressively prune unpromising reasoning branches during multi-path decoding. The method alternates between a draft phase that encourages diverse exploration and a gating phase that prunes branches on a trajectory-weighted score, followed by a continuation phase that completes the best surviving path. Across GSM8K and MATH500 benchmarks with two open-source models, KAPPA substantially reduces memory and token usage relative to BoN, with up to ~60% peak memory savings and ~90% token reductions, while maintaining or modestly improving accuracy, particularly for smaller models. The results demonstrate that KAPPA stabilizes performance in smaller models and offers practical inference-time efficiency gains, though larger models may require adaptive pruning schedules to avoid over-pruning. This work provides a hardware-efficient, training-free approach to scalable chain-of-thought reasoning in LLMs, with clear implications for real-time reasoning tasks and deployment on resource-constrained devices.

Abstract

Large language models (LLMs) improve reasoning accuracy when generating multiple candidate solutions at test time, but standard methods like Best-of-N (BoN) incur high computational cost by fully generating all branches. Self-Truncation Best-of-N (ST-BoN) mitigates this by truncating unpromising paths early, but its reliance on consistency-based heuristics is a limitation as it does not directly evaluate branch quality. We present KL-Adjusted Pruned Path Algorithm (KAPPA), an inference-time method that combines Kullback-Leibler divergence, confidence, and entropy into a principled scoring function to guide progressive pruning. By promoting diversity during exploration and selectively eliminating low-scoring branches, KAPPA maintains accuracy while substantially reducing memory and token usage. Experiments on GSM8K and MATH500 with DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-7B-Instruct demonstrate that KAPPA stabilizes performance in smaller models and achieves up to ~60% reduction in peak memory and ~90% reduction in total token generation relative to BoN, with minimal impact on accuracy.

Paper Structure

This paper contains 15 sections, 3 figures, 2 algorithms.

Figures (3)

  • Figure 1: The computational cost and accuracy results in two LLMs across two mathematical and reasoning datasets as labeled. Each point on each polyline represents different sampling sizes $N = 5, 10, 20$ from left to right.
  • Figure 2: The computed peak memory reduction ratio under different sampling sizes $N$.
  • Figure 3: The computed token reduction ratio under different sampling sizes $N$.