Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Wenxuan Xie; Yujia Wang; Xin Tan; Chaochao Lu; Xia Hu; Xuhong Wang

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang

TL;DR

DRIFT tackles the challenge of incorporating extensive knowledge into LLMs by decoupling knowledge extraction from reasoning. A lightweight knowledge model compresses document chunks into query-aware implicit fact tokens, which are projected into a larger reasoning model's embedding space to enable efficient long-context inference. The approach introduces bucketed compression and a staged training regime (LFRP, QAFT dynamic compression, QA) to align latent facts with downstream reasoning, achieving strong gains over compression baselines and demonstrating robustness across model sizes. Empirical results on diverse long-context benchmarks show DRIFT improves accuracy and latency, illustrating the practical impact of decoupled, latent-context reasoning. The work also provides data and code, and points to future RL-based enhancements and interpretability improvements for latent tokens.

Abstract

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

TL;DR

Abstract

Paper Structure (49 sections, 12 equations, 11 figures, 6 tables)

This paper contains 49 sections, 12 equations, 11 figures, 6 tables.

Introduction
Related Work
Prompt Compression
Hard Compression (Token Selection).
Soft Compression (Latent Representation).
Learned Memory
Methodology
Bucketed Compression: Beyond Fixed-Ratio Compression
DRIFT: Decoupled Reasoning with Implicit Fact Tokens
Task Definition
Latent Fact Reconstruction Pretraining (LFRP)
Query-Aware Fine-Tuning (QAFT) with Single-Context
Dynamic Compression Task
Question-answering Task
Multi-Context Inference without Fine-Tuning
...and 34 more sections

Figures (11)

Figure 1: The overall workflow of DRIFT. DRIFT implements knowledge compression and decoupled reasoning in four steps. Step 1: The long document $X$ is recursively partitioned into semantically coherent chunks to preserve structural integrity. Step 2: The small knowledge model $\psi_{kno}$ compresses query-relevant information from each chunk in parallel into latent implicit fact tokens $T_J$. Step 3: The latent tokens are concatenated and mapped by an MLP projector $\pi$ to align with the reasoning model's embedding space. Step 4: The large reasoning model $\theta_{rea}$ generates the final response by performing efficient inference on the concatenated embeddings.
Figure 2: Three different trainging tasks for DRIFT. The instructions in the figure include the dynamic compression instruction, reconstruct instruction, answer instruction, and static compression instructio.
Figure 3: End-to-end TTFT as a function of input length for different baselines.
Figure 4: An Automated Pipeline for Contextual Question-Answering Data Synthesis
Figure 5: Training loss trajectory of Stage 1
...and 6 more figures

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

TL;DR

Abstract

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (11)