Table of Contents
Fetching ...

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang

TL;DR

SEAL identifies execution, reflection, and transition thoughts in LLM reasoning and demonstrates that excessive reflection/transition steps harm efficiency and accuracy. It introduces a training-free latent-space steering method: offline extraction of a steering vector and on-the-fly latent-space intervention during decoding to boost execution-focused reasoning. Across multiple models and benchmarks (Math500, GSM8K, LiveCodeBench, NaturalPlan), SEAL delivers consistent accuracy gains (up to ~14%) and substantial token reductions (up to ~50%), with strong transferability across tasks. The approach requires no fine-tuning or architectural changes, offering a practical, interpretable means to optimize reasoning in large language models.

Abstract

Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

TL;DR

SEAL identifies execution, reflection, and transition thoughts in LLM reasoning and demonstrates that excessive reflection/transition steps harm efficiency and accuracy. It introduces a training-free latent-space steering method: offline extraction of a steering vector and on-the-fly latent-space intervention during decoding to boost execution-focused reasoning. Across multiple models and benchmarks (Math500, GSM8K, LiveCodeBench, NaturalPlan), SEAL delivers consistent accuracy gains (up to ~14%) and substantial token reductions (up to ~50%), with strong transferability across tasks. The approach requires no fine-tuning or architectural changes, offering a practical, interpretable means to optimize reasoning in large language models.

Abstract

Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.

Paper Structure

This paper contains 29 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: One example from DeepSeek-R1-Distill-Qwen-7B for a math question, where the entire response is divided into individual thought blocks.
  • Figure 2: Statistics on the number of different types of thoughts for subsets of samples that the model answered correctly and incorrectly. The results are derived from DeepSeek-R1-Distill-Qwen-1.5B on the the Math-500 task. Response lengths are reported numerically.
  • Figure 3: Results of t-SNE visualization of different reasoning thoughts in the latent space.
  • Figure 4: Overview of our SEAL framework. The upper subfigure illustrates the offline extraction process of the reasoning steering vector, while the lower subfigure depicts the inference process utilizing the extracted steering vector.
  • Figure 5: Logits Penalty makes other reflection and transition thoughts increase
  • ...and 4 more figures