Table of Contents
Fetching ...

Reranking with Compressed Document Representation

Hervé Déjean, Stéphane Clinchant

TL;DR

This work tackles the efficiency bottleneck of LLM-based document reranking by introducing RRK, which uses compressed document representations generated by a frozen PISCO compressor to produce fixed-size embeddings. A finetuned Mistral-7B decoder scores query-document pairs, trained via distillation from a strong teacher (SPLADE-V3 + Naver-DeBERTa) on MS MARCO, with an objective based on mean-squared error and Sequence-level Knowledge Distillation. The RRK framework confines input to a small fixed length (32 tokens) using 8 memory tokens per document, enabling substantial speed-ups (up to 16×) while preserving effectiveness on standard IR benchmarks, especially for long documents. This approach demonstrates the practicality of compressed representations for retrieval-augmented systems, offering a scalable path for integrating LLM-based rerankers with traditional retrievers; limitations include dependence on query length and compression length, with future work exploring smaller models and improved compression for even longer content.

Abstract

Reranking, the process of refining the output of a first-stage retriever, is often considered computationally expensive, especially with Large Language Models. Borrowing from recent advances in document compression for RAG, we reduce the input size by compressing documents into fixed-size embedding representations. We then teach a reranker to use compressed inputs by distillation. Although based on a billion-size model, our trained reranker using this compressed input can challenge smaller rerankers in terms of both effectiveness and efficiency, especially for long documents. Given that text compressors are still in their early development stages, we view this approach as promising.

Reranking with Compressed Document Representation

TL;DR

This work tackles the efficiency bottleneck of LLM-based document reranking by introducing RRK, which uses compressed document representations generated by a frozen PISCO compressor to produce fixed-size embeddings. A finetuned Mistral-7B decoder scores query-document pairs, trained via distillation from a strong teacher (SPLADE-V3 + Naver-DeBERTa) on MS MARCO, with an objective based on mean-squared error and Sequence-level Knowledge Distillation. The RRK framework confines input to a small fixed length (32 tokens) using 8 memory tokens per document, enabling substantial speed-ups (up to 16×) while preserving effectiveness on standard IR benchmarks, especially for long documents. This approach demonstrates the practicality of compressed representations for retrieval-augmented systems, offering a scalable path for integrating LLM-based rerankers with traditional retrievers; limitations include dependence on query length and compression length, with future work exploring smaller models and improved compression for even longer content.

Abstract

Reranking, the process of refining the output of a first-stage retriever, is often considered computationally expensive, especially with Large Language Models. Borrowing from recent advances in document compression for RAG, we reduce the input size by compressing documents into fixed-size embedding representations. We then teach a reranker to use compressed inputs by distillation. Although based on a billion-size model, our trained reranker using this compressed input can challenge smaller rerankers in terms of both effectiveness and efficiency, especially for long documents. Given that text compressors are still in their early development stages, we view this approach as promising.

Paper Structure

This paper contains 10 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 2: Reranking processing time according to the input length. Using compressed representation (RRK models) enables our reranker to maintain constant efficiency regardless of document length.
  • Figure 3: PISCO Architecture pisco: The compression process utilizes a language model with LoRA adapters, appending memory tokens to each document to form embeddings, which control the compression rate through optimization. Decoding involves fine-tuning the decoder to adapt generation with compressed representations based on queries. The distillation objective employs Sequence-level Knowledge Distillation (SKD) to ensure models give consistent answers whether inputs are compressed or not.