Table of Contents
Fetching ...

AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment

Ahmad Aghaebrahimian

TL;DR

This paper tackles factual hallucination in LLM outputs by introducing AlignCheck, a schema-free, fact-based framework that decomposes text into atomic facts and uses a weighted F1-like metric augmented with BERTScore to measure factual overlap. It enables interpretability through FP and FN diagnostics and a TF-IDF weighting scheme over entity types, and it validates the approach on general and clinical datasets (AgreeFact and MIMIC-IV-Ext-BHC) with various fine-tuning strategies. The results demonstrate significant differences among summarization models in terms of factuality, supporting AlignCheck as a diagnostic and potentially training-friendly objective for fact-aware generation. The work advances practical evaluation of factual consistency and provides a pathway to integrate factual checks into model training and deployment in high-stakes domains.

Abstract

Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.

AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment

TL;DR

This paper tackles factual hallucination in LLM outputs by introducing AlignCheck, a schema-free, fact-based framework that decomposes text into atomic facts and uses a weighted F1-like metric augmented with BERTScore to measure factual overlap. It enables interpretability through FP and FN diagnostics and a TF-IDF weighting scheme over entity types, and it validates the approach on general and clinical datasets (AgreeFact and MIMIC-IV-Ext-BHC) with various fine-tuning strategies. The results demonstrate significant differences among summarization models in terms of factuality, supporting AlignCheck as a diagnostic and potentially training-friendly objective for fact-aware generation. The work advances practical evaluation of factual consistency and provides a pathway to integrate factual checks into model training and deployment in high-stakes domains.

Abstract

Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.

Paper Structure

This paper contains 5 sections, 2 tables, 1 algorithm.