Table of Contents
Fetching ...

4bit-Quantization in Vector-Embedding for RAG

Taehee Jeong

TL;DR

This paper tackles the memory bottleneck of retrieval-augmented generation by applying 4-bit and 8-bit quantization to high-dimensional embedding vectors. It investigates BF16, INT8, and INT4 quantization, introducing a symmetric quantization pipeline and exploring group-wise INT4 to balance memory savings with retrieval accuracy. Empirical results show that 8-bit quantization largely preserves retrieval performance, while 4-bit quantization degrades accuracy unless mitigated by group-wise schemes; compared to Product Quantization, INT4-based methods offer stronger retrieval fidelity. The findings suggest substantial memory reductions (potentially enabling larger vector databases) with practical implications for deploying RAG in memory-constrained environments, albeit with hardware and software limitations to be addressed.

Abstract

Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG

4bit-Quantization in Vector-Embedding for RAG

TL;DR

This paper tackles the memory bottleneck of retrieval-augmented generation by applying 4-bit and 8-bit quantization to high-dimensional embedding vectors. It investigates BF16, INT8, and INT4 quantization, introducing a symmetric quantization pipeline and exploring group-wise INT4 to balance memory savings with retrieval accuracy. Empirical results show that 8-bit quantization largely preserves retrieval performance, while 4-bit quantization degrades accuracy unless mitigated by group-wise schemes; compared to Product Quantization, INT4-based methods offer stronger retrieval fidelity. The findings suggest substantial memory reductions (potentially enabling larger vector databases) with practical implications for deploying RAG in memory-constrained environments, albeit with hardware and software limitations to be addressed.

Abstract

Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
Paper Structure (11 sections, 5 equations, 7 figures, 3 tables)

This paper contains 11 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: RAG system
  • Figure 2: Distribution of embedding dimensions
  • Figure 3: Data types
  • Figure 4: Distribution of cosine similarity for all pairs of 1000 vectors. (d) INT4 with group size 32
  • Figure 5: RMSE of the cosine similarity values of quantized pairs
  • ...and 2 more figures