Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao
TL;DR
The paper tackles medical report generation by identifying three core challenges: insufficient domain knowledge understanding, poor text-visual alignment, and spurious cross-modal correlations. It introduces HTSC-CIF, a three-tier framework comprising a Domain Knowledge Enhancement Module, a Cross-Modal Alignment Module with Prefix Language Modeling and Masked Image Modeling, and a Cross-Modal Causal Intervention Module with visual and language deconfounding, linked by a staged training procedure. Empirical results on MIMIC-CXR and IU-Xray demonstrate state-of-the-art performance on METEOR and ROUGE-L and strong gains in other metrics, with ablations highlighting the contributions of each module. The approach improves reliability and interpretability of MRG and signals practical benefits for clinical reporting, with code forthcoming on acceptance.
Abstract
Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
