Table of Contents
Fetching ...

The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs

Mert Yazan, Suzan Verberne, Frederik Situmeang

TL;DR

The study investigates whether post-training INT4 quantization preserves retrieval-augmented generation (RAG) performance in small LLMs (7B–8B) for long-context personalized tasks. It evaluates multiple models with AWQ quantization against FP16 baselines on LaMP-3U and LaMP-5U, varying the number of retrieved documents and using three retrievers. Findings show quantization effects are highly model- and task-dependent: some models (notably OpenChat) tolerate INT4 with minimal loss, while others (e.g., LLaMA2) are more sensitive as retrieval load increases. The results demonstrate that quantized 7B LLMs can serve effectively as RAG backbones, enabling more affordable deployment, and highlight the need for exploring additional quantization methods and retrieval configurations in future work.

Abstract

Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. Since LLM abilities emerge with scale, smaller LLMs are more sensitive to quantization. In this paper, we explore how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG), specifically in longer contexts. We chose personalization for evaluation because it is a challenging domain to perform using RAG as it requires long-context reasoning over multiple documents. We compare the original FP16 and the quantized INT4 performance of multiple 7B and 8B LLMs on two tasks while progressively increasing the number of retrieved documents to test how quantized models fare against longer contexts. To better understand the effect of retrieval, we evaluate three retrieval models in our experiments. Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities. We conclude that it is possible to utilize RAG with quantized smaller LLMs.

The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs

TL;DR

The study investigates whether post-training INT4 quantization preserves retrieval-augmented generation (RAG) performance in small LLMs (7B–8B) for long-context personalized tasks. It evaluates multiple models with AWQ quantization against FP16 baselines on LaMP-3U and LaMP-5U, varying the number of retrieved documents and using three retrievers. Findings show quantization effects are highly model- and task-dependent: some models (notably OpenChat) tolerate INT4 with minimal loss, while others (e.g., LLaMA2) are more sensitive as retrieval load increases. The results demonstrate that quantized 7B LLMs can serve effectively as RAG backbones, enabling more affordable deployment, and highlight the need for exploring additional quantization methods and retrieval configurations in future work.

Abstract

Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. Since LLM abilities emerge with scale, smaller LLMs are more sensitive to quantization. In this paper, we explore how quantization affects smaller LLMs' ability to perform retrieval-augmented generation (RAG), specifically in longer contexts. We chose personalization for evaluation because it is a challenging domain to perform using RAG as it requires long-context reasoning over multiple documents. We compare the original FP16 and the quantized INT4 performance of multiple 7B and 8B LLMs on two tasks while progressively increasing the number of retrieved documents to test how quantized models fare against longer contexts. To better understand the effect of retrieval, we evaluate three retrieval models in our experiments. Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities. We conclude that it is possible to utilize RAG with quantized smaller LLMs.
Paper Structure (14 sections, 2 figures, 2 tables)

This paper contains 14 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Prompts used for both datasets. The ones on the top represent $k=0$ (zero-shot, no retrieved documents) and the ones on the bottom are for $k>0$ settings (RAG). The green text is the model output. Line endings are not shown for space reasons.
  • Figure 2: Results for both datasets. The upper and lower borders of each colored area represent the quantized and not-quantized performances of the models, and the corresponding lines are the mean of both.