Table of Contents
Fetching ...

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank

TL;DR

Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports, is introduced, which shows that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias.

Abstract

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

TL;DR

Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports, is introduced, which shows that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias.

Abstract

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
Paper Structure (9 sections, 4 equations, 3 figures, 2 tables)

This paper contains 9 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Clinical interpretation of CAD. Each point is a vocabulary term, mapped from its demographic association in the reference corpus ($z_{ref}$, x-axis) to its association in predictions ($z_{pred}$, y-axis). Points near the diagonal (gray band) indicate stable associations, while off-diagonal points highlight terms whose demographic association has shifted, reflecting erasure, new bias, or bias flips.
  • Figure 2: (a)Clinical performance comparison (w/ Template Diversity %). (b)Global lexical shift across inference strategies. (c)$\Delta$WAE successfully captures the semantic shifts caused by sex-skewed training.
  • Figure 3: Ablation: Sweeping $p_{\mathrm{disp}}^{\star}$ preserves relative model ranking: CAD detects more emergent stereotypes in stochastic outputs across the full range of significance levels.