Table of Contents
Fetching ...

Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

Yuhao Shen, Jiahe Qian, Shuping Zhang, Zhangtianyi Chen, Tao Lu, Juexiao Zhou

TL;DR

This work tackles the critical need for trustworthy evaluation of dermatology diagnostic narratives generated by multimodal LLMs. It introduces DermBench, a benchmark pairing images with physician-certified references scored on six clinical dimensions, and DermEval, a reference-free multimodal evaluator trained to align with physician judgments via Score Oriented REINFORCE with an EMA baseline. Empirical results show close alignment with expert ratings (MAE of 0.251 for DermBench and 0.117 for DermEval) and consistent model rankings across nine multimodal systems, highlighting complementary strengths and limitations across dimensions such as Accuracy, Safety, and Medical Groundedness. The proposed framework enables scalable, clinically grounded assessment of diagnostic narratives, facilitating safer deployment and iterative model improvement in dermatology. Limitations include dataset diversity and potential reference bias, with future work focusing on broader data sources, calibration, and clinical workflow integration.

Abstract

Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.

Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

TL;DR

This work tackles the critical need for trustworthy evaluation of dermatology diagnostic narratives generated by multimodal LLMs. It introduces DermBench, a benchmark pairing images with physician-certified references scored on six clinical dimensions, and DermEval, a reference-free multimodal evaluator trained to align with physician judgments via Score Oriented REINFORCE with an EMA baseline. Empirical results show close alignment with expert ratings (MAE of 0.251 for DermBench and 0.117 for DermEval) and consistent model rankings across nine multimodal systems, highlighting complementary strengths and limitations across dimensions such as Accuracy, Safety, and Medical Groundedness. The proposed framework enables scalable, clinically grounded assessment of diagnostic narratives, facilitating safer deployment and iterative model improvement in dermatology. Limitations include dataset diversity and potential reference bias, with future work focusing on broader data sources, calibration, and clinical workflow integration.

Abstract

Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of DermBench and DermEval. DermBench evaluates a candidate diagnostic narrative by comparing it to a physician-approved reference text for the same image, whereas DermEval evaluates directly from the given image and diagnostic text without requiring a reference.
  • Figure 2: Dataset construction. A dual stream pipeline is used. The first stream produces clinician verified, high-quality diagnostic narratives that become the certified references for DermBench. The second stream produces diagnostic narratives of varying quality. Image and text pairs from both streams are used to train the evaluator DermEval.
  • Figure 3: DermBench evaluation workflow. A candidate LLM generates a diagnostic narrative from a standardized prompt. An LLM judge compares the narrative with the clinician-certified reference and assigns six scores, namely Accuracy, Safety, Medical Groundedness, Clinical Coverage, Reasoning Coherence, and Description Precision.
  • Figure 4: DermEval training pipeline. The evaluator takes an image and a diagnostic text, generates a structured evaluation, and an external LLM extracts six scores in the range from zero to five. Physician scores define a negative mean squared error reward, an exponential moving average baseline yields a low variance advantage, and policy gradients are applied only to the generated segment.
  • Figure 5: Distribution of skin disease categories covered by our dataset. The pie chart illustrates the proportion of images in each of the 23 dermatological categories used for model training and evaluation.
  • ...and 1 more figures