Table of Contents
Fetching ...

Dynamic Traceback Learning for Medical Report Generation

Shuchang Ye, Mingyuan Meng, Mingjian Li, Dagan Feng, Usman Naseem, Jinman Kim

TL;DR

Dynamic Traceback Learning (DTrace) tackles two core problems in automated medical report generation: inability to capture subtle pathology and poor zero-shot performance when inference is image-only. It introduces a traceback mechanism and a dynamic mask-ratio training regime that enforce semantic grounding and bidirectional cross-modal supervision between images and reports. The framework comprises five modules (visual encoder/decoder, linguistic encoder/decoder, and cross-modal fusion) and employs forward and traceback training with adaptively weighted losses; evaluation on IU-Xray and MIMIC-CXR shows state-of-the-art results on both natural language generation and clinical-efficacy metrics. The approach promises practical impact by producing more clinically accurate, linguistically coherent reports from images alone, reducing radiologist workload while maintaining robust diagnostic grounding.

Abstract

Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.

Dynamic Traceback Learning for Medical Report Generation

TL;DR

Dynamic Traceback Learning (DTrace) tackles two core problems in automated medical report generation: inability to capture subtle pathology and poor zero-shot performance when inference is image-only. It introduces a traceback mechanism and a dynamic mask-ratio training regime that enforce semantic grounding and bidirectional cross-modal supervision between images and reports. The framework comprises five modules (visual encoder/decoder, linguistic encoder/decoder, and cross-modal fusion) and employs forward and traceback training with adaptively weighted losses; evaluation on IU-Xray and MIMIC-CXR shows state-of-the-art results on both natural language generation and clinical-efficacy metrics. The approach promises practical impact by producing more clinically accurate, linguistically coherent reports from images alone, reducing radiologist workload while maintaining robust diagnostic grounding.

Abstract

Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
Paper Structure (22 sections, 7 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of different generative frameworks. (a) Common unimodal encoder-decoder framework, (b) Multimodal masked encoder-decoder framework for generative representation learning (GRL), and (c) Our proposed framework with dynamic traceback learning (DTrace).
  • Figure 2: Illustration of the limitation of the existing report generation framework. a) The model makes predictions based on spurious generation statistics, overlooking the radiology images; b) Ideally, the model should understand the pathological information in the image and generate a report accordingly.
  • Figure 3: DTrace with dynamic traceback learning. Solid and dashed lines indicate forward and traceback stages. Training: (1) Forward stage: masked images and masked reports are encoded by the visual and linguistic encoders, fused by the cross-modal module, and decoded to reconstruct images and reports. (2) Traceback stage: the reconstructed images and reports are re-encoded to verify semantic validity, with losses computed against unmasked ground-truth. Inference: the visual encoder processes the full image, and at each decoding step, the linguistic encoder encodes the autoregressively generated prefix tokens; fused features are passed to the linguistic decoder to predict the next report token until an end token is generated.
  • Figure 4: Dynamic traceback learning with forward stage (left) and traceback stage (right). The freeze means that there is no gradient descent in back-propagation.
  • Figure 5: An example (patient 10014765) of comparisons between different report generation frameworks and the proposed DTrace framework. The information in the ground truth report is labeled from 1 to 6 and highlighted separately. The generated reports are labeled according to the ground truth report and highlighted with different colors to represent the differences between the generated sequences and the ground truth report: (1) Green - consistent; (2) Blue - semantically similar but different in expression; (3) Pink - incorrect information; (4) Gray - missing sentences; 5) Unhighlighted - not included in the ground truth.
  • ...and 4 more figures