Table of Contents
Fetching ...

Med-CoDE: Medical Critique based Disagreement Evaluation Framework

Mohit Gupta, Akiko Aizawa, Rajiv Ratn Shah

TL;DR

This work tackles the challenge of robustly evaluating medical LLMs, where inaccuracies can have clinical consequences. It introduces Med-CoDE, a critique-based disagreement framework that quantifies discrepancies between model outputs and medical ground truths. The approach includes a disease-grounded critique dataset, a critique-generation model (Phi-3) and a critique-classification model (BERT), producing four levels of disagreement (None, Low, Moderate, High). Experiments on four medical benchmarks across multiple LLMs show that Med-CoDE correlates with ground-truth accuracy, reveals model-specific disagreement patterns, and provides more nuanced evaluation than traditional string metrics. The work advances reliable evaluation for medical AI and can be adapted to other domains by swapping critique data.

Abstract

The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.

Med-CoDE: Medical Critique based Disagreement Evaluation Framework

TL;DR

This work tackles the challenge of robustly evaluating medical LLMs, where inaccuracies can have clinical consequences. It introduces Med-CoDE, a critique-based disagreement framework that quantifies discrepancies between model outputs and medical ground truths. The approach includes a disease-grounded critique dataset, a critique-generation model (Phi-3) and a critique-classification model (BERT), producing four levels of disagreement (None, Low, Moderate, High). Experiments on four medical benchmarks across multiple LLMs show that Med-CoDE correlates with ground-truth accuracy, reveals model-specific disagreement patterns, and provides more nuanced evaluation than traditional string metrics. The work advances reliable evaluation for medical AI and can be adapted to other domains by swapping critique data.

Abstract

The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.

Paper Structure

This paper contains 14 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Med-Code Framework
  • Figure 2: The overall Fine-tuning pipeline for Critique Generator & Classifier.
  • Figure 3: Critique Generation Prompt Template
  • Figure 4: Critique data samples with different dis-agreement Labels