Table of Contents
Fetching ...

Cognitive-Aligned Document Selection for Retrieval-augmented Generation

Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang

TL;DR

This work presents GGatrieval, a retrieval-augmented generation framework that enforces a cognitive-inspired alignment criterion to select verifiable documents. Through Fine-grained Grounded Alignment and Semantic Compensation Query Update, it iteratively refines queries and documents to achieve high alignment between query components and retrieved segments. The approach demonstrates state-of-the-art performance on the ALCE benchmark and significant gains in verifiability metrics across diverse datasets, while also reducing retrieval load compared to prior iterative methods. Limitations include computational overhead and reliance on semantic relations without explicit modeling of logical component interactions; future work aims to address these through reinforcement learning and deeper analysis of alignment dynamics.

Abstract

Large language models (LLMs) inherently display hallucinations since the precision of generated texts cannot be guaranteed purely by the parametric knowledge they include. Although retrieval-augmented generation (RAG) systems enhance the accuracy and reliability of generative models by incorporating external documents, these retrieved documents often fail to adequately support the model's responses in practical applications. To address this issue, we propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment Re\textbf{trieval} for verifiable generation), which leverages an LLM to dynamically update queries and filter high-quality, reliable retrieval documents. Specifically, we parse the user query into its syntactic components and perform fine-grained grounded alignment with the retrieved documents. For query components that cannot be individually aligned, we propose a dynamic semantic compensation mechanism that iteratively refines and rewrites the query while continuously updating the retrieval results. This iterative process continues until the retrieved documents sufficiently support the query's response. Our approach introduces a novel criterion for filtering retrieved documents, closely emulating human strategies for acquiring targeted information. This ensures that the retrieved content effectively supports and verifies the generated outputs. On the ALCE benchmark, our method significantly surpasses a wide range of baselines, achieving state-of-the-art performance.

Cognitive-Aligned Document Selection for Retrieval-augmented Generation

TL;DR

This work presents GGatrieval, a retrieval-augmented generation framework that enforces a cognitive-inspired alignment criterion to select verifiable documents. Through Fine-grained Grounded Alignment and Semantic Compensation Query Update, it iteratively refines queries and documents to achieve high alignment between query components and retrieved segments. The approach demonstrates state-of-the-art performance on the ALCE benchmark and significant gains in verifiability metrics across diverse datasets, while also reducing retrieval load compared to prior iterative methods. Limitations include computational overhead and reliance on semantic relations without explicit modeling of logical component interactions; future work aims to address these through reinforcement learning and deeper analysis of alignment dynamics.

Abstract

Large language models (LLMs) inherently display hallucinations since the precision of generated texts cannot be guaranteed purely by the parametric knowledge they include. Although retrieval-augmented generation (RAG) systems enhance the accuracy and reliability of generative models by incorporating external documents, these retrieved documents often fail to adequately support the model's responses in practical applications. To address this issue, we propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment Re\textbf{trieval} for verifiable generation), which leverages an LLM to dynamically update queries and filter high-quality, reliable retrieval documents. Specifically, we parse the user query into its syntactic components and perform fine-grained grounded alignment with the retrieved documents. For query components that cannot be individually aligned, we propose a dynamic semantic compensation mechanism that iteratively refines and rewrites the query while continuously updating the retrieval results. This iterative process continues until the retrieved documents sufficiently support the query's response. Our approach introduces a novel criterion for filtering retrieved documents, closely emulating human strategies for acquiring targeted information. This ensures that the retrieved content effectively supports and verifies the generated outputs. On the ALCE benchmark, our method significantly surpasses a wide range of baselines, achieving state-of-the-art performance.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example for our criterion. The example shows that for each syntactic component of the query, a corresponding grounded segment in the retrieved document can be identified that aligns with it.
  • Figure 2: Overview of our GGatrieval. Our approach (Section \ref{['Section3.4']}) leverages large language models for the Fine-grained Grounded Alignment strategy (Section \ref{['Section3.2']}) to obtain document labels Full Alignment, Partial Alignment, and No Alignment (Section \ref{['Section3.1']}), while implementing the Dynamic Semantic Compensation strategy (Section \ref{['Section3.3']}) for query updates to enhance the retrieval of highly aligned documents.
  • Figure 3: Impact of interactions among components. The solid line represents the evaluation metric, while the dashed line indicates the distribution of Full Alignment labels within the sample.
  • Figure 4: Cross-dataset analysis of label proportions. The bars above the dashed line represent system performance, while the bars below the dashed line indicate the proportion of documents with different labels.