Optimization of Retrieval-Augmented Generation Context with Outlier Detection

Vitaly Bulgakov

Optimization of Retrieval-Augmented Generation Context with Outlier Detection

Vitaly Bulgakov

TL;DR

This work tackles the problem of oversized and noisy prompt contexts in Retrieval-Augmented Generation by filtering out semantically irrelevant documents. It develops a distance-based feature pipeline that computes $d_{centroid}$ and $d_{query}$, builds diverse feature vectors, standardizes them, and optionally reduces dimensionality with PCA before applying a Gaussian Mixture Model for outlier detection. The study finds that the interaction-style feature representation yields the strongest improvements, especially for more complex questions, and that higher min_outlier_freq improves similarity to ground-truth answers at the cost of greater computation. Overall, the approach reduces context length and improves answer quality in RAG systems, with demonstrated robustness across small and larger models and datasets such as SQuAD2.0.

Abstract

In this paper, we focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Attempts to increase the number of retrieved chunked documents and thereby enlarge the context related to the query can significantly complicate the processing and decrease the performance of a Large Language Model (LLM) when generating responses to queries. It is well known that a large set of documents retrieved from a database in response to a query may contain irrelevant information, which often leads to hallucinations in the resulting answers. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. We propose and evaluate several methods for identifying outliers by creating features that utilize the distances of embedding vectors, retrieved from the vector database, to both the centroid and the query vectors. The methods were evaluated by comparing the similarities of the retrieved LLM responses to ground-truth answers obtained using the OpenAI GPT-4o model. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.

Optimization of Retrieval-Augmented Generation Context with Outlier Detection

TL;DR

and

, builds diverse feature vectors, standardizes them, and optionally reduces dimensionality with PCA before applying a Gaussian Mixture Model for outlier detection. The study finds that the interaction-style feature representation yields the strongest improvements, especially for more complex questions, and that higher min_outlier_freq improves similarity to ground-truth answers at the cost of greater computation. Overall, the approach reduces context length and improves answer quality in RAG systems, with demonstrated robustness across small and larger models and datasets such as SQuAD2.0.

Abstract

Paper Structure (5 sections, 12 equations, 8 figures)

This paper contains 5 sections, 12 equations, 8 figures.

Distance Calculation
Feature Creation Methods
Standardization
Dimensionality Reduction
Outlier Detection

Figures (8)

Figure 1: Summary of conducted experiments
Figure 2: Clusters and outliers with 2 principal components of the feature vector
Figure 3: Similarity improvement with increasing question complexity
Figure 4: The average values of similarity changes are presented as the number of processed questions increases (Part 1).
Figure 5: The average values of similarity changes are presented as the number of processed questions increases (Part 2).
...and 3 more figures

Optimization of Retrieval-Augmented Generation Context with Outlier Detection

TL;DR

Abstract

Optimization of Retrieval-Augmented Generation Context with Outlier Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)