Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang
TL;DR
DRIFT tackles the challenge of incorporating extensive knowledge into LLMs by decoupling knowledge extraction from reasoning. A lightweight knowledge model compresses document chunks into query-aware implicit fact tokens, which are projected into a larger reasoning model's embedding space to enable efficient long-context inference. The approach introduces bucketed compression and a staged training regime (LFRP, QAFT dynamic compression, QA) to align latent facts with downstream reasoning, achieving strong gains over compression baselines and demonstrating robustness across model sizes. Empirical results on diverse long-context benchmarks show DRIFT improves accuracy and latency, illustrating the practical impact of decoupled, latent-context reasoning. The work also provides data and code, and points to future RL-based enhancements and interpretability improvements for latent tokens.
Abstract
The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.
