Agentic Verification for Ambiguous Query Disambiguation
Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He
TL;DR
This work tackles ambiguous queries in retrieval-augmented generation by introducing VerDICt, a unified framework that couples diversification with verification through agentic feedback from both retriever and generator. By grounding interpretations to the corpus during diversification and consolidating noisy feedback via clustering, VerDICt reduces cascading errors and improves efficiency, achieving substantial grounded performance gains on ASQA. Empirically, it delivers higher grounded precision and recall (G-precision up to 93% and G-recall around 57%) while maintaining diversity across model sizes, and demonstrates production viability in Snowflake’s Cortex Agents API. The approach advances enterprise RAG by ensuring verifiable, context-grounded disambiguation, with open-sourced code and verifiability evaluation framework for future research.
Abstract
In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.
