Table of Contents
Fetching ...

Textual Inversion and Self-supervised Refinement for Radiology Report Generation

Yuanjiang Luo, Hongxiang Li, Xuan Wu, Meng Cao, Xiaoshuang Huang, Zhihong Zhu, Peixi Liao, Hu Chen, Yi Zhang

TL;DR

This paper tackles faithful radiology report generation by addressing two key issues: ensuring grounding to visual content and bridging the modality gap between images and text. It introduces Textual Inversion and Self-supervised Refinement (TISR), which maps image features into a shared textual space as pseudo words and refines them using a contrastive objective to align textual and visual representations. The approach is plug-and-play and yields consistent improvements across multiple baselines on IU X-ray and MIMIC-CXR, outperforming in both standard NLG metrics and clinically oriented CheXbert evaluations. By enabling more faithful, image-consistent reports without extra labeling, TISR has potential to enhance clinical efficiency and diagnostic reliability in radiology workflows.

Abstract

Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon.

Textual Inversion and Self-supervised Refinement for Radiology Report Generation

TL;DR

This paper tackles faithful radiology report generation by addressing two key issues: ensuring grounding to visual content and bridging the modality gap between images and text. It introduces Textual Inversion and Self-supervised Refinement (TISR), which maps image features into a shared textual space as pseudo words and refines them using a contrastive objective to align textual and visual representations. The approach is plug-and-play and yields consistent improvements across multiple baselines on IU X-ray and MIMIC-CXR, outperforming in both standard NLG metrics and clinically oriented CheXbert evaluations. By enabling more faithful, image-consistent reports without extra labeling, TISR has potential to enhance clinical efficiency and diagnostic reliability in radiology workflows.

Abstract

Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon.
Paper Structure (9 sections, 8 equations, 3 figures, 4 tables)

This paper contains 9 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Existing challenges in radiology report generation.
  • Figure 2: Overview of our method. The arrow dashed line indicates that before obtaining the entire report, the current word is generated by the image features and text embeddings obtained by encoding the previously generated words.
  • Figure 3: Visualization. Red: the network is highly concerned about this area, blue: the area that is not concerned, black line: correct description, red line: incorrect description.