Table of Contents
Fetching ...

Agentic Verification for Ambiguous Query Disambiguation

Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He

TL;DR

This work tackles ambiguous queries in retrieval-augmented generation by introducing VerDICt, a unified framework that couples diversification with verification through agentic feedback from both retriever and generator. By grounding interpretations to the corpus during diversification and consolidating noisy feedback via clustering, VerDICt reduces cascading errors and improves efficiency, achieving substantial grounded performance gains on ASQA. Empirically, it delivers higher grounded precision and recall (G-precision up to 93% and G-recall around 57%) while maintaining diversity across model sizes, and demonstrates production viability in Snowflake’s Cortex Agents API. The approach advances enterprise RAG by ensuring verifiable, context-grounded disambiguation, with open-sourced code and verifiability evaluation framework for future research.

Abstract

In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.

Agentic Verification for Ambiguous Query Disambiguation

TL;DR

This work tackles ambiguous queries in retrieval-augmented generation by introducing VerDICt, a unified framework that couples diversification with verification through agentic feedback from both retriever and generator. By grounding interpretations to the corpus during diversification and consolidating noisy feedback via clustering, VerDICt reduces cascading errors and improves efficiency, achieving substantial grounded performance gains on ASQA. Empirically, it delivers higher grounded precision and recall (G-precision up to 93% and G-recall around 57%) while maintaining diversity across model sizes, and demonstrates production viability in Snowflake’s Cortex Agents API. The approach advances enterprise RAG by ensuring verifiable, context-grounded disambiguation, with open-sourced code and verifiability evaluation framework for future research.

Abstract

In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.

Paper Structure

This paper contains 36 sections, 22 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison of (a) DtV (Diversify-then-Verify) and (b) VerDICt (Verified-Diversification with Consolidation, ours).
  • Figure 2: Illustration of the full pipeline of VerDICt: Verified diversification (Section \ref{['subsec:extraction']}) followed by consolidation phase (Section \ref{['subsec:clustering']}). On the right, yellow and blue dots represent embeddings of generated interpretations and their answers, embedded together after concatenation, while yellow color denotes medoids, or representatives chosen from each cluster.
  • Figure 3: Analysis on accuracy of generated $\hat{q}$'s and $\hat{y}$'s from VerDICt. (i, answer correctness) Models easily derive correct $\hat{q}, \hat{y}$ given an answerable passage. (ii, interpretation error rate) Impact of model scale is more critical in discerning unanswerable passages.
  • Figure 4: Query clarification in Snowflake’s Cortex Agents API setup with tool access to a series of synthetically generated insurance documents retrieved via Cortex Search services.
  • Figure 5: Comparison of end-to-end workflow of VerDICt and DtV (DIVA; in-etal-2024-diversify-arxiv) for handling ambiguous question answering task. Vertical arrangement denotes sequential dependency, while calls that can run in parallel are placed at the same horizontal level.
  • ...and 4 more figures