Table of Contents
Fetching ...

VISTA Score: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White

TL;DR

VISTA Score reframes factuality in multi-turn dialogue as a dynamic, turn-based process. It decomposes each assistant turn into atomic claims, verifies them against evolving background knowledge and current references, and categorizes unverifiable content into four types, enabling sequential consistency tracking. Across four dialogue benchmarks (FaithDial, BEGIN, FADE, AIS) and eight models, VISTA achieves higher hallucination-detection accuracy than FACTSCORE and LLM-as-Judge, with statistically significant gains ($p<0.05$) and pronounced improvements for open-weight models; human evaluation confirms better annotator agreement and uncovers benchmark inconsistencies. By providing a modular, interpretable framework and releasing a 140-conversation dataset, VISTA offers a human-aligned, scalable path to more trustworthy dialogue systems.

Abstract

Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

VISTA Score: Verification In Sequential Turn-based Assessment

TL;DR

VISTA Score reframes factuality in multi-turn dialogue as a dynamic, turn-based process. It decomposes each assistant turn into atomic claims, verifies them against evolving background knowledge and current references, and categorizes unverifiable content into four types, enabling sequential consistency tracking. Across four dialogue benchmarks (FaithDial, BEGIN, FADE, AIS) and eight models, VISTA achieves higher hallucination-detection accuracy than FACTSCORE and LLM-as-Judge, with statistically significant gains () and pronounced improvements for open-weight models; human evaluation confirms better annotator agreement and uncovers benchmark inconsistencies. By providing a modular, interpretable framework and releasing a 140-conversation dataset, VISTA offers a human-aligned, scalable path to more trustworthy dialogue systems.

Abstract

Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

Paper Structure

This paper contains 28 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Overview of the VISTA Score pipeline. Each assistant turn undergoes claim extraction, verification, and categorization, with accumulated background facts informing subsequent turns.
  • Figure 2: Accuracy of models on hallucination detection on the 140 conversations (227 turns) selected for human evaluation. The darker color bars represent accuracy according to the original datasets' labels. The lighter color bars represent accuracy according to the human evaluators' consensus labels in this study.
  • Figure :