CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He; Richard He Bai; Sinead Williamson; Jeff Z. Pan; Navdeep Jaitly; Yizhe Zhang

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang

TL;DR

CLaRa presents a unified retrieval-generation framework that shifts from traditional, disjoint RAG pipelines to end-to-end optimization in a shared latent space using embedded document representations. Central to the approach is SCP, which uses guided data synthesis (QA and paraphrase signals) to train a salient compressor that produces compact, semantically rich embeddings; CLaRa then jointly trains a query reasoner and a generator over these fixed embeddings with a differentiable top-k mechanism, enabling gradient flow without explicit relevance labels. Theoretical gradient coupling analysis shows dual learning signals that stabilize joint training, and empirical results on four QA benchmarks demonstrate state-of-the-art compression and retrieval performance, often surpassing text-based baselines while significantly reducing context needs. The work also highlights practical benefits such as memory-token representations and LoRA-based adapters, offering scalable, efficient RAG with strong generalization potential in real-world QA and reasoning tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

TL;DR

Abstract

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)