Free(): Learning to Forget in Malloc-Only Reasoning Models

Yilun Zheng; Dongyang Ma; Tian Liang; Jiahao Xu; Xinting Huang; Lihui Chen; Haitao Mi; Yan Wang

Free(): Learning to Forget in Malloc-Only Reasoning Models

Yilun Zheng, Dongyang Ma, Tian Liang, Jiahao Xu, Xinting Huang, Lihui Chen, Haitao Mi, Yan Wang

TL;DR

This work identifies a fundamental limitation of standard reasoning models: they accumulate reasoning steps without discarding obsolete information, leading to degradation on long-horizon tasks. It introduces Free()LM, which adds a plug-and-play Free-Module LoRA that alternates between Reasoning (unmerged) and Cleaning (merged) modes to prune redundant context and maintain a compact, noise-free state. Through a reward-based data synthesis pipeline, Free()LM learns effective pruning operations and demonstrates consistent gains across model scales (8B–685B), including a new SOTA on IMOAnswerBench and recovery from collapse in long-horizon tasks. The results suggest that sustainable, scalable intelligence requires the ability to forget as much as the power to think, enabling practical, self-managed reasoning agents.

Abstract

Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as "malloc-only" engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state. Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to 50%. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.

Free(): Learning to Forget in Malloc-Only Reasoning Models

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 7 figures, 4 tables)

This paper contains 32 sections, 3 equations, 7 figures, 4 tables.

Introduction
Method
Architecture & Inference
Pruning & Resumed Inference:
Triggering:
Training: Learning to Forget
Data Synthesis
Reward Mechanism
Experiments
Settings
Backbone Models.
Compared Methods.
Benchmarks.
Implementation Details.
Main Results
...and 17 more sections

Figures (7)

Figure 1: Empirical observation of Qwen3-8B reasoning on AIME benchmarks. Left: Standard LLMs passively accumulate tokens, causing the reasoning process to eventually "crash" (degenerate). Right: Free()LM integrates an intrinsic free() mechanism. By periodically identifying and pruning redundant reasoning steps, it actively maintains a compact, noise-free state, enabling sustainable long-chain reasoning.
Figure 2: The Free()LM Inference Framework. The model operates on a cyclic Reasoning-Cleaning mechanism. Reasoning: The Main model generates tokens normally with the Free-Module unmerged. Cleaning: Upon reaching a chunk limit, the Free-Module is merged to identify and prune redundant chunks. Resumed Reasoning: The module is unmerged, and reasoning resumes on the cleaned context.
Figure 3: The Data Construction Pipeline. (a) Data Synthesis: We segment raw trajectories into 1k-token chunks and employ Gemini-2.5-Pro to sequentially generate candidate training instances. (b) Reward Mechanism: By executing $K=8$ parallel rollouts, we retain an instance only if the pruned context $C_{\text{new}}$ maintains or improves accuracy compared to the original ($\text{Acc}(C_{\text{new}}) \ge \text{Acc}(C_{\text{raw}})$).
Figure 4: Performance vs. Reasoning Length on HLE. While standard Qwen3-235B-A22B suffers from Degradation on trajectories longer than 80k tokens, Free()LM exhibits a striking rebound in accuracy. On these cases, the Free-Module reduces the context length by $\sim$45%, effectively mitigating context pollution.
Figure 5: Case study comparing Free()LM versus Gemini deletion. Deleted spans are shown in red, new generated content in green, and re-generated content matching previous deletions in blue. Free()LM (left) successfully prunes redundant reasoning; in contrast, Gemini (right) erroneously deletes critical logical anchors, forcing the subsequent reasoning model to re-generate the previously pruned context to restore the integrity of the reasoning chain.
...and 2 more figures

Free(): Learning to Forget in Malloc-Only Reasoning Models

TL;DR

Abstract

Free(): Learning to Forget in Malloc-Only Reasoning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)