Dynamic Traceback Learning for Medical Report Generation
Shuchang Ye, Mingyuan Meng, Mingjian Li, Dagan Feng, Usman Naseem, Jinman Kim
TL;DR
Dynamic Traceback Learning (DTrace) tackles two core problems in automated medical report generation: inability to capture subtle pathology and poor zero-shot performance when inference is image-only. It introduces a traceback mechanism and a dynamic mask-ratio training regime that enforce semantic grounding and bidirectional cross-modal supervision between images and reports. The framework comprises five modules (visual encoder/decoder, linguistic encoder/decoder, and cross-modal fusion) and employs forward and traceback training with adaptively weighted losses; evaluation on IU-Xray and MIMIC-CXR shows state-of-the-art results on both natural language generation and clinical-efficacy metrics. The approach promises practical impact by producing more clinically accurate, linguistically coherent reports from images alone, reducing radiologist workload while maintaining robust diagnostic grounding.
Abstract
Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
