Table of Contents
Fetching ...

LightThinker++: From Reasoning Compression to Memory Management

Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang

Abstract

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.

LightThinker++: From Reasoning Compression to Memory Management

Abstract

Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.

Paper Structure

This paper contains 90 sections, 15 equations, 29 figures, 11 tables.

Figures (29)

  • Figure 1: An illustration of the compressed reasoning paradigms. (a) A CoT example. Tokens highlighted in yellow represent critical reasoning tokens, while the remaining tokens primarily ensure fluency. (b) The Vanilla approach generates full thought tokens. (c) LightThinker compresses each thought into a concise representation ($C\ T_i$). (d) LightThinker++ further incorporates explicit memory management to handle summaries of thoughts, enhancing reasoning efficiency and coherence.
  • Figure 2: An overview of LightThinker, illustrated with a three-step reasoning example. Fig. (a) shows the attention mask of Vanilla during both training and inference. Fig. (b) depicts the attention mask of LightThinker during the training. Fig. (c) presents the complete inference process of LightThinker along with the attention mask corresponding to each step. Here, 'w' denotes the size of the matrix.
  • Figure 3: Relationship between context length and the number of generated tokens across different methods. The Dependency metric corresponds to the area under the curve, while Peak Token indicates the curve's maximum value. See Appx. \ref{['sec:app:metric']} for details.
  • Figure 4: Overview of LightThinker++. a) Memory Action Space: Reasoning steps are instantiated as dual-form entities $\mathcal{I}_i = (R_i, Z_i)$. The visibility state of each step is explicitly managed via commit, expand, and fold primitives. b) Inference Overview: An illustration of the step-wise inference process. $\tilde{\mathcal{H}}_t$ denotes the stateful managed context where historical steps are dynamically projected as either summaries ($Z_i$, marked with $<$) or raw derivations ($R_i$, marked with $\vee$) based on the model's self-directed memory policy.
  • Figure 5: Efficiency Analysis and Ablation Results. (a) shows the average number of generated tokens for each model on each dataset. (b) shows the distribution of token lengths across ranges, while the cumulative curve indicates the overall proportion up to each range. (c) illustrates the relationship between output length and inference time, with each subplot reporting inference time and peak token count. (d) reports the average compression ratios, with error bars showing 95% confidence intervals. (e--f) examine how cache size $|C|$ affects accuracy, Dep, inference time, peak tokens, generated tokens, and compression frequency.
  • ...and 24 more figures