Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

Shunfan Zheng; Xiechi Zhang; Gerard de Melo; Xiaoling Wang; Linlin Wang

Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

Shunfan Zheng, Xiechi Zhang, Gerard de Melo, Xiaoling Wang, Linlin Wang

TL;DR

HDCEval tackles misalignment in medical LLM evaluation by proposing a Hierarchical Divide-and-Conquer framework guided by professional medical guidelines. It decomposes complex judgments into specialized subtasks tackled by expert models and trained with Attribute-Driven Token Optimization (ADTO) on a carefully constructed preference dataset, with evaluation outputs expressed as $E=ig\{E_1,...,E_m\big\}$ where each $E_i=(s_i,p_i)$. Across a multisource medical dataset, HDCEval significantly outperforms baselines and shows higher consistency with human evaluators (notably a 23.92% gain over PandaLM), while maintaining robustness to input form variations and reducing model bias via its clever preference-data strategy. The approach yields finer-grained, rationale-supported assessments that better reflect clinical reasoning, suggesting practical impact for safer, more reliable medical AI evaluation in freestyle clinical contexts.

Abstract

In the rapidly evolving landscape of large language models (LLMs) for medical applications, ensuring the reliability and accuracy of these models in clinical settings is paramount. Existing benchmarks often focus on fixed-format tasks like multiple-choice QA, which fail to capture the complexity of real-world clinical diagnostics. Moreover, traditional evaluation metrics and LLM-based evaluators struggle with misalignment, often providing oversimplified assessments that do not adequately reflect human judgment. To address these challenges, we introduce HDCEval, a Hierarchical Divide-and-Conquer Evaluation framework tailored for fine-grained alignment in medical evaluation. HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors, encompassing Patient Question Relevance, Medical Knowledge Correctness, and Expression. The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models trained through Attribute-Driven Token Optimization (ADTO) on a meticulously curated preference dataset. This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.

Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

TL;DR

where each

. Across a multisource medical dataset, HDCEval significantly outperforms baselines and shows higher consistency with human evaluators (notably a 23.92% gain over PandaLM), while maintaining robustness to input form variations and reducing model bias via its clever preference-data strategy. The approach yields finer-grained, rationale-supported assessments that better reflect clinical reasoning, suggesting practical impact for safer, more reliable medical AI evaluation in freestyle clinical contexts.

Abstract

Paper Structure (32 sections, 3 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Methodology
Task Formulation
Fine-grained Medical Evaluation Guidelines
Hierarchical Divide-and-Conquer Evaluation Framework
Overview
Hierarchical Divide
Preference Data Construction
Attribute-Driven Token Optimization
Experiments
Experimental Setup
Medical Dataset
Data Source
Dataset Construction
Dataset Validation
...and 17 more sections

Figures (5)

Figure 1: Fixed format task for evaluation.
Figure 3: Overview of the Hierarchical Divide-and-Conquer Evaluation Framework. "Hierarchical Divide" represents the Divide component, while "Preference Data Construction" and "Attribute-Driven Token Optimization" constitute the Conquer component.
Figure 4: The performance of MedAlpaca and ChatDoctor across multiple medical scenarios is evaluated using HDCEval and compared to human doctors' judgments. "Win" indicates the percentage of cases where a given medical language model outperforms the other, while "Tie" indicates the percentage of cases where both medical LLMs received the same score.
Figure 5: Preferences of human doctors between Our Method, GPT-4, and PandaLM.
Figure 6: Multi Evaluation Task (Win, Tie, Lose) of HDCEval and Humans.

Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

TL;DR

Abstract

Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)