Context Embeddings for Efficient Answer Generation in RAG

David Rau; Shuai Wang; Hervé Déjean; Stéphane Clinchant

Context Embeddings for Efficient Answer Generation in RAG

David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant

TL;DR

COCOM is presented, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin, and allows for different compression rates trading off decoding time for answer quality.

Abstract

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 $\times$ while achieving higher performance compared to existing efficient context compression methods.

Context Embeddings for Efficient Answer Generation in RAG

TL;DR

Abstract

while achieving higher performance compared to existing efficient context compression methods.

Paper Structure (39 sections, 9 equations, 3 figures, 12 tables)

This paper contains 39 sections, 9 equations, 3 figures, 12 tables.

Introduction
Related Work
Lexical-based Compression.
Embedding-based Compression.
Overview
Methodology
Task Definition: RAG
COCOM: Effective Context Compression
Adaptable Compression Rate
Multiple Contexts
Pre-training Context Embeddings
Auto-encoding with Context Embeddings.
Language Modeling from Context Embeddings.
Fine-tuning
Experimental Setup
...and 24 more sections

Figures (3)

Figure 1: COCOM: Compressing multiple contexts for RAG into a small set ($\xi = {4, 16, 128}$) of Context Embeddings leads to a massive speed up in answer generation while maintaining higher performance compared to other methods. Results are shown for the ASQA dataset.
Figure 2: Overview of our COCOM(-light) model pipeline.
Figure 3: Impact on zero-shot transferability of fine-tuning on multiple datasets (multi) concurrently vs. on a single dataset for COCOM. Compression rate $\xi$ = 128

Context Embeddings for Efficient Answer Generation in RAG

TL;DR

Abstract

Context Embeddings for Efficient Answer Generation in RAG

Authors

TL;DR

Abstract

Table of Contents

Figures (3)