Table of Contents
Fetching ...

Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

Yunhao Liu, Zian Jia, Xinyu Gao, Kanjun Xu, Yun Xiong

TL;DR

SeleCom is introduced, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector that significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines.

Abstract

Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.

Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

TL;DR

SeleCom is introduced, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector that significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines.

Abstract

Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.
Paper Structure (53 sections, 9 equations, 9 figures, 8 tables)

This paper contains 53 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison between Full-compression and Query-conditioned selection. Full document compression causes information losses and overloads the generator towards instruction negligence. The query-conditioned selector extracts only necessary information, leading to performance improvement and better query awareness.
  • Figure 2: Illustration of instruction (non-)following behaviors under full-compression and non-compression settings.
  • Figure 3: Attention heatmap showing how baseline RAG (blue) and full-compression encoder (yellow) focuses on the input document tokens.
  • Figure 4: Pipeline overview of SeleCom. Left: the overall workflow—retrieval of top-$k$ documents, query-conditioned selection and compression into $p$ embeddings, projection, and generation of the final answer. Right: the core selector mechanism, where autoregressive generation aggregates information into special-token embeddings with a dedicated, trainable embedding layer.
  • Figure 5: Illustration of the data construction process for the selection training data (Stage1).
  • ...and 4 more figures