Context-Driven Index Trimming: A Data Quality Perspective to Enhancing Precision of RALMs
Kexin Ma, Ruochun Jin, Xi Wang, Huan Chen, Jing Ren, Yuhua Tang
TL;DR
This work tackles data quality issues in Retrieval-Augmented LLMs (RALMs) by introducing Context Matching Dependencies (CMDs), a principled rule-based framework that enforces consistency between retrieved contexts and their semantic meaning. The Context-Driven Index Trimming (CDIT) framework leverages CMDs and LLMs to prune inconsistent retrievals and to trim vector database indices, thereby improving downstream answer accuracy. Empirical results across diverse datasets and models show average improvements of roughly 3–5% with peaks up to 15.21%, demonstrated across multiple indexing schemes and even when integrated with Self-RAG. While effective, the approach notes limitations in handling long texts, relies on GPT-based judgments, and encourages future CMD mining and offline alternatives to reduce dependence on online LLMs.
Abstract
Retrieval-Augmented Large Language Models (RALMs) have made significant strides in enhancing the accuracy of generated responses.However, existing research often overlooks the data quality issues within retrieval results, often caused by inaccurate existing vector-distance-based retrieval methods.We propose to boost the precision of RALMs' answers from a data quality perspective through the Context-Driven Index Trimming (CDIT) framework, where Context Matching Dependencies (CMDs) are employed as logical data quality rules to capture and regulate the consistency between retrieved contexts.Based on the semantic comprehension capabilities of Large Language Models (LLMs), CDIT can effectively identify and discard retrieval results that are inconsistent with the query context and further modify indexes in the database, thereby improving answer quality.Experiments demonstrate on challenging question-answering tasks.Also, the flexibility of CDIT is verified through its compatibility with various language models and indexing methods, which offers a promising approach to bolster RALMs' data quality and retrieval precision jointly.
