Table of Contents
Fetching ...

PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

Maxime Louis, Hervé Déjean, Stéphane Clinchant

TL;DR

PISCO tackles the scalability challenge of Retrieval-Augmented Generation by introducing a soft, memory-token based document compressor trained with sequence-level knowledge distillation from document-based questions. It achieves a 16x compression rate with only 0–3% accuracy loss across diverse RAG-QA tasks, without requiring pretraining or annotated data, and enables fine-tuning a 7–10B LLM in about 48 hours on a single A100. Empirical results show up to a 5.7x inference speed-up and an 8% accuracy advantage over prior soft compression methods, with strong generalization to in-domain, out-of-domain, and multilingual tasks. The approach relies on end-to-end fine-tuning of a compressor and a decoder with LoRA adapters, using SKD where a teacher generates full answers from original documents to supervise the compressed representations. Overall, PISCO provides a scalable, drop-in compression solution for RAG that reduces computational cost while preserving QA performance and broad applicability.

Abstract

Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.

PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

TL;DR

PISCO tackles the scalability challenge of Retrieval-Augmented Generation by introducing a soft, memory-token based document compressor trained with sequence-level knowledge distillation from document-based questions. It achieves a 16x compression rate with only 0–3% accuracy loss across diverse RAG-QA tasks, without requiring pretraining or annotated data, and enables fine-tuning a 7–10B LLM in about 48 hours on a single A100. Empirical results show up to a 5.7x inference speed-up and an 8% accuracy advantage over prior soft compression methods, with strong generalization to in-domain, out-of-domain, and multilingual tasks. The approach relies on end-to-end fine-tuning of a compressor and a decoder with LoRA adapters, using SKD where a teacher generates full answers from original documents to supervise the compressed representations. Overall, PISCO provides a scalable, drop-in compression solution for RAG that reduces computational cost while preserving QA performance and broad applicability.

Abstract

Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 48 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.

Paper Structure

This paper contains 34 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 2: PISCO substantially outperforms existing context compression methods for question answering with RAG. Shown here with Mistral-7B backbone.
  • Figure 3: Overview of PISCO training, shown here with $k=2$ documents. Training is supervised by distillation from a teacher model. Once trained, the full collection of documents can be compressed once to allow fast inference.
  • Figure 4: Pairwise comparison with GPT-4o shows that PISCO, utilizing the Mistral-7B backbone, outperforms COCOM across all datasets. It performs comparably to Mistral-7B while achieving a 16x compression rate.
  • Figure 5: Performances on pretraining tasks versus performance on RAG-QA. Correlations are very small, indicating that pretraining has only little benefits on the downstream QA task.
  • Figure 6: Impact of the number of fine-tuning samples on performance, with and without pretraining. Pretraining only improves QA performance for low fine-tuning sample size.
  • ...and 7 more figures