Table of Contents
Fetching ...

LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

Ruixiao Yang, Yuanhe Tian, Xu Yang, Huiqi Li, Yan Song

TL;DR

This work tackles hallucinations in radiology report generation by addressing intrinsic decoding biases of large language models. It introduces Layer-wise Expert-aligned Decoding (LEAD), which injects multi-label visual pathology signals into every decoder layer through a visual expert module and a context-adaptive gated fusion mechanism, guided by confidence-weighted, layer-specific projections. The training objective combines next-token generation with supervised pathology classification, balanced by a coupling parameter $\lambda = 4$, and employs LoRA on the LLM while fully fine-tuning the vision components. Empirical results on CheXpert Plus and MIMIC-CXR show that LEAD improves clinical accuracy and reduces hallucinations across model scales, with ablative evidence highlighting the importance of layer-wise projections and dynamic gating. Overall, LEAD demonstrates that internal, adaptive decoding guidance can outperform external knowledge interventions for faithful radiology report generation, enabling more robust deployed systems in clinical settings.

Abstract

Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.

LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

TL;DR

This work tackles hallucinations in radiology report generation by addressing intrinsic decoding biases of large language models. It introduces Layer-wise Expert-aligned Decoding (LEAD), which injects multi-label visual pathology signals into every decoder layer through a visual expert module and a context-adaptive gated fusion mechanism, guided by confidence-weighted, layer-specific projections. The training objective combines next-token generation with supervised pathology classification, balanced by a coupling parameter , and employs LoRA on the LLM while fully fine-tuning the vision components. Empirical results on CheXpert Plus and MIMIC-CXR show that LEAD improves clinical accuracy and reduces hallucinations across model scales, with ablative evidence highlighting the importance of layer-wise projections and dynamic gating. Overall, LEAD demonstrates that internal, adaptive decoding guidance can outperform external knowledge interventions for faithful radiology report generation, enabling more robust deployed systems in clinical settings.

Abstract

Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.
Paper Structure (20 sections, 6 equations, 3 figures, 4 tables)

This paper contains 20 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Conceptual illustration of Layer-wise Expert-aligned Decoding (LEAD). Unlike standard VLMs and existing methods that often succumb to intrinsic language priors (top and middle rows), our approach directly intervenes in the internal decoding process. By adaptively injecting fine-grained visual expert signals into intermediate representations of each decoder layer, LEAD dynamically rectifies the generation trajectory to ensure faithful alignment with fine-grained medical visual facts (bottom row).
  • Figure 2: An overview of the proposed framework.
  • Figure 3: Qualitative comparison of generated reports. "Backbone" represents the results of the fully unfrozen hybrid fine-tuned backbone. Green text indicates correct pathological findings, red highlights factual errors, and yellow denotes missed findings.