Table of Contents
Fetching ...

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang

TL;DR

DRIFT tackles the challenge of incorporating extensive knowledge into LLMs by decoupling knowledge extraction from reasoning. A lightweight knowledge model compresses document chunks into query-aware implicit fact tokens, which are projected into a larger reasoning model's embedding space to enable efficient long-context inference. The approach introduces bucketed compression and a staged training regime (LFRP, QAFT dynamic compression, QA) to align latent facts with downstream reasoning, achieving strong gains over compression baselines and demonstrating robustness across model sizes. Empirical results on diverse long-context benchmarks show DRIFT improves accuracy and latency, illustrating the practical impact of decoupled, latent-context reasoning. The work also provides data and code, and points to future RL-based enhancements and interpretability improvements for latent tokens.

Abstract

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

TL;DR

DRIFT tackles the challenge of incorporating extensive knowledge into LLMs by decoupling knowledge extraction from reasoning. A lightweight knowledge model compresses document chunks into query-aware implicit fact tokens, which are projected into a larger reasoning model's embedding space to enable efficient long-context inference. The approach introduces bucketed compression and a staged training regime (LFRP, QAFT dynamic compression, QA) to align latent facts with downstream reasoning, achieving strong gains over compression baselines and demonstrating robustness across model sizes. Empirical results on diverse long-context benchmarks show DRIFT improves accuracy and latency, illustrating the practical impact of decoupled, latent-context reasoning. The work also provides data and code, and points to future RL-based enhancements and interpretability improvements for latent tokens.

Abstract

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.
Paper Structure (49 sections, 12 equations, 11 figures, 6 tables)

This paper contains 49 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The overall workflow of DRIFT. DRIFT implements knowledge compression and decoupled reasoning in four steps. Step 1: The long document $X$ is recursively partitioned into semantically coherent chunks to preserve structural integrity. Step 2: The small knowledge model $\psi_{kno}$ compresses query-relevant information from each chunk in parallel into latent implicit fact tokens $T_J$. Step 3: The latent tokens are concatenated and mapped by an MLP projector $\pi$ to align with the reasoning model's embedding space. Step 4: The large reasoning model $\theta_{rea}$ generates the final response by performing efficient inference on the concatenated embeddings.
  • Figure 2: Three different trainging tasks for DRIFT. The instructions in the figure include the dynamic compression instruction, reconstruct instruction, answer instruction, and static compression instructio.
  • Figure 3: End-to-end TTFT as a function of input length for different baselines.
  • Figure 4: An Automated Pipeline for Contextual Question-Answering Data Synthesis
  • Figure 5: Training loss trajectory of Stage 1
  • ...and 6 more figures