MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification
Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze
TL;DR
MedScore introduces a domain-adapted decompose-then-verify factuality evaluation pipeline tailored for medical free-form answers. It defines a MedScore Taxonomy to guide claim decomposition and uses a modular verification stage against internal knowledge, doctor responses, and an external medical corpus via MedRAG. The authors validate on AskDocsAI (medical), PUMA (out-of-domain medical), and CaLMQA (non-medical), showing higher valid-claim coverage and lower invalid-claim rates than baselines, while revealing sensitivity to decomposition prompts and verifier choices. They also demonstrate cross-domain generalizability with minimal prompt changes, arguing for the pipeline's adaptability to other fields. Limitations include partial coverage of completeness, potential verifier bias, and limited publicly available doctor-produced datasets.
Abstract
While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.
