Table of Contents
Fetching ...

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze

TL;DR

MedScore introduces a domain-adapted decompose-then-verify factuality evaluation pipeline tailored for medical free-form answers. It defines a MedScore Taxonomy to guide claim decomposition and uses a modular verification stage against internal knowledge, doctor responses, and an external medical corpus via MedRAG. The authors validate on AskDocsAI (medical), PUMA (out-of-domain medical), and CaLMQA (non-medical), showing higher valid-claim coverage and lower invalid-claim rates than baselines, while revealing sensitivity to decomposition prompts and verifier choices. They also demonstrate cross-domain generalizability with minimal prompt changes, arguing for the pipeline's adaptability to other fields. Limitations include partial coverage of completeness, potential verifier bias, and limited publicly available doctor-produced datasets.

Abstract

While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

TL;DR

MedScore introduces a domain-adapted decompose-then-verify factuality evaluation pipeline tailored for medical free-form answers. It defines a MedScore Taxonomy to guide claim decomposition and uses a modular verification stage against internal knowledge, doctor responses, and an external medical corpus via MedRAG. The authors validate on AskDocsAI (medical), PUMA (out-of-domain medical), and CaLMQA (non-medical), showing higher valid-claim coverage and lower invalid-claim rates than baselines, while revealing sensitivity to decomposition prompts and verifier choices. They also demonstrate cross-domain generalizability with minimal prompt changes, arguing for the pipeline's adaptability to other fields. Limitations include partial coverage of completeness, potential verifier bias, and limited publicly available doctor-produced datasets.

Abstract

While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.

Paper Structure

This paper contains 34 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: The decompose-then-verify pipeline for factuality evaluation on AskDocsAI, using MedScore condition-aware decomposition and medical corpus verification. The decomposition step breaks down sentences into one or more "atomic facts". The verification step checks the factuality of each fact given a context. The context shown in the figure consists of medical passages retrieved from an external medical corpus. The full AskDocsAI data example used here is in \ref{['tab:AskDocsAI_example']}.
  • Figure 2: Number of extracted claims per chatbot response (left) and sentence (right) from FActScore, MedScore, and VeriScore decomposition methods for AskDocsAI.
  • Figure 3: Number of extracted claims per answer (left) and sentence (right) from FActScore, MedScore, and VeriScoreQA decomposition methods for PUMA.