Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

Edward Y. Chang

Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

Edward Y. Chang

TL;DR

The paper tackles the problem of misdiagnosis and the propagation of mislabeled ground-truth data into ML-driven clinical workflows. It introduces EVINCE, an entropy-based framework where multiple LLMs engage in structured, contentious debates, guided by information-duality principles to balance exploration and convergence. Core contributions include the IDEA theory for optimal LLM pairing (one high-entropy and one low-entropy with equal information quality) and Algorithmic Robust Aggregation (ARA) to minimize online regret and stabilize predictions. Empirical studies in Dengue vs. Chikungunya and ground-truth robustness/remediation demonstrate modest to notable gains in diagnostic accuracy and reveal how debate-driven uncertainty can surface ground-truth inconsistencies for remediation. Collectively, EVINCE offers a practical pathway to improve diagnostic precision and to audit and refine historical medical labels, with potential impact on patient safety and trust in AI-augmented healthcare.

Abstract

Misdiagnosis is a significant issue in healthcare, leading to harmful consequences for patients. The propagation of mislabeled data through machine learning models into clinical practice is unacceptable. This paper proposes EVINCE, a system designed to 1) improve diagnosis accuracy and 2) rectify misdiagnoses and minimize training data errors. EVINCE stands for Entropy Variation through Information Duality with Equal Competence, leveraging this novel theory to optimize the diagnostic process using multiple Large Language Models (LLMs) in a structured debate framework. Our empirical study verifies EVINCE to be effective in achieving its design goals.

Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

TL;DR

Abstract

Paper Structure (30 sections, 11 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Contentious Debate
Optimal Conditioning
Related Work
EVINCE Algorithm
Improving Diagnosis Accuracy
Optimal Pairing of LLMs
Theory IDEA: Optimal Pairing of LLMs for Probabilistic Diagnostic Accuracy.
Algorithmic Robust Aggregation (ARA)
Modeling Nature and the Aggregator
Modeling Regret
Empirical Study
Debate: Dengue Fever vs. Chikungunya
Moderator's Prompt
GPT-4's Opening Round
...and 15 more sections

Figures (4)

Figure 1: Pre-/post-debate accuracy shows EVINCE helps
Figure 2: Confusion matrices. Claude3 provides more alternatives (or bridges) for GPT4 to reconsider.
Figure 3: Robust aggregation reaches stable aggregated prediction
Figure 4: Remediation: Jaundice to Hepatitis

Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

TL;DR

Abstract

Ensuring Ground Truth Accuracy in Healthcare with the EVINCE framework

Authors

TL;DR

Abstract

Table of Contents

Figures (4)