Table of Contents
Fetching ...

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang

TL;DR

CLaRa presents a unified retrieval-generation framework that shifts from traditional, disjoint RAG pipelines to end-to-end optimization in a shared latent space using embedded document representations. Central to the approach is SCP, which uses guided data synthesis (QA and paraphrase signals) to train a salient compressor that produces compact, semantically rich embeddings; CLaRa then jointly trains a query reasoner and a generator over these fixed embeddings with a differentiable top-k mechanism, enabling gradient flow without explicit relevance labels. Theoretical gradient coupling analysis shows dual learning signals that stabilize joint training, and empirical results on four QA benchmarks demonstrate state-of-the-art compression and retrieval performance, often surpassing text-based baselines while significantly reducing context needs. The work also highlights practical benefits such as memory-token representations and LoRA-based adapters, offering scalable, efficient RAG with strong generalization potential in real-world QA and reasoning tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

TL;DR

CLaRa presents a unified retrieval-generation framework that shifts from traditional, disjoint RAG pipelines to end-to-end optimization in a shared latent space using embedded document representations. Central to the approach is SCP, which uses guided data synthesis (QA and paraphrase signals) to train a salient compressor that produces compact, semantically rich embeddings; CLaRa then jointly trains a query reasoner and a generator over these fixed embeddings with a differentiable top-k mechanism, enabling gradient flow without explicit relevance labels. Theoretical gradient coupling analysis shows dual learning signals that stabilize joint training, and empirical results on four QA benchmarks demonstrate state-of-the-art compression and retrieval performance, often surpassing text-based baselines while significantly reducing context needs. The work also highlights practical benefits such as memory-token representations and LoRA-based adapters, offering scalable, efficient RAG with strong generalization potential in real-world QA and reasoning tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

Paper Structure

This paper contains 83 sections, 29 equations, 17 figures, 21 tables, 1 algorithm.

Figures (17)

  • Figure 1: (a) During training, we first pretrain the compressor to encourage it to retain only essential information. Next, we perform offline compression of the documents. After that, we encode the query using the query reasoner, retrieve the compressed document representations for generation, and use only the final next-token prediction loss to jointly update both the query reasoner and the generator. (b) An example from the inference stage: the tokens represent key clue words related to the question. When we decode the continuous query embedding, we find that it contains information not present in the original query, indicating that it has learned some of the intermediate reasoning keywords.
  • Figure 2: Overview of the SCP (Salient Compressor Pretraining) framework. It includes (a) synthetic data construction for pretraining, (2) compressor training using the pretraining data.
  • Figure 3: CLaRa end-to-end training: update query reasoner ($\theta_{qr}$) and generator ($\theta_g$) via language modeling loss using candidate document--question--answer triples.
  • Figure 4: Retrieval performance (Recall@1/3/5) on the Mistral-7B model across different reranking methods under compression ratios = 4 and various initialization settings on NQ and HotpotQA datasets. Sup- denotes models trained with labeled data using contrastive learning for the reranker. -Pretrain denotes experiments conducted using the model checkpoint obtained after pretraining, while -Instruct denotes experiments conducted using the model checkpoint obtained after instruction tuning.
  • Figure 5: Validation loss curves during the compression pretraining stage under different compression ratios (CR) on the Phi-4-mini (left) and Mistral-7B (right) models.
  • ...and 12 more figures