Table of Contents
Fetching ...

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

Yutong Xie, Qi Chen, Sinuo Wang, Minh-Son To, Iris Lee, Ee Win Khoo, Kerolos Hendy, Daniel Koh, Yong Xia, Qi Wu

TL;DR

This work addresses the scarcity of paired radiology image-text data by introducing PairAug, a dual-branch augmentation framework that simultaneously expands both modalities. InterAug creates novel image-report pairs by generating reports with a large language model and synthesizing images from those reports, while IntraAug edits existing pairs by swapping diffusion cross-attention maps to produce diverse conditions for the same patient; both branches are followed by semantically-aware pruning. By combining real MIMIC-CXR data with PairAug-generated samples, the authors train a CheXzero-style vision-language model and demonstrate improved zero-shot and fine-tuning performance on multiple chest radiograph benchmarks, outperforming single-modality augmentation baselines and other medical VLP methods. The approach shows significant potential for scalable, privacy-preserving medical VLP, with practical implications for broader adoption of robust radiology diagnosis systems, albeit contingent on generation quality and careful data pruning.

Abstract

Current vision-language pre-training (VLP) methodologies predominantly depend on paired image-text datasets, a resource that is challenging to acquire in radiology due to privacy considerations and labelling complexities. Data augmentation provides a practical solution to overcome the issue of data scarcity, however, most augmentation methods exhibit a limited focus, prioritising either image or text augmentation exclusively. Acknowledging this limitation, our objective is to devise a framework capable of concurrently augmenting medical image and text data. We design a Pairwise Augmentation (PairAug) approach that contains an Inter-patient Augmentation (InterAug) branch and an Intra-patient Augmentation (IntraAug) branch. Specifically, the InterAug branch of our approach generates radiology images using synthesised yet plausible reports derived from a Large Language Model (LLM). The generated pairs can be considered a collection of new patient cases since they are artificially created and may not exist in the original dataset. In contrast, the IntraAug branch uses newly generated reports to manipulate images. This process allows us to create new paired data for each individual with diverse medical conditions. Our extensive experiments on various downstream tasks covering medical image classification zero-shot and fine-tuning analysis demonstrate that our PairAug, concurrently expanding both image and text data, substantially outperforms image-/text-only expansion baselines and advanced medical VLP baselines. Our code is released at \url{https://github.com/YtongXie/PairAug}.

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

TL;DR

This work addresses the scarcity of paired radiology image-text data by introducing PairAug, a dual-branch augmentation framework that simultaneously expands both modalities. InterAug creates novel image-report pairs by generating reports with a large language model and synthesizing images from those reports, while IntraAug edits existing pairs by swapping diffusion cross-attention maps to produce diverse conditions for the same patient; both branches are followed by semantically-aware pruning. By combining real MIMIC-CXR data with PairAug-generated samples, the authors train a CheXzero-style vision-language model and demonstrate improved zero-shot and fine-tuning performance on multiple chest radiograph benchmarks, outperforming single-modality augmentation baselines and other medical VLP methods. The approach shows significant potential for scalable, privacy-preserving medical VLP, with practical implications for broader adoption of robust radiology diagnosis systems, albeit contingent on generation quality and careful data pruning.

Abstract

Current vision-language pre-training (VLP) methodologies predominantly depend on paired image-text datasets, a resource that is challenging to acquire in radiology due to privacy considerations and labelling complexities. Data augmentation provides a practical solution to overcome the issue of data scarcity, however, most augmentation methods exhibit a limited focus, prioritising either image or text augmentation exclusively. Acknowledging this limitation, our objective is to devise a framework capable of concurrently augmenting medical image and text data. We design a Pairwise Augmentation (PairAug) approach that contains an Inter-patient Augmentation (InterAug) branch and an Intra-patient Augmentation (IntraAug) branch. Specifically, the InterAug branch of our approach generates radiology images using synthesised yet plausible reports derived from a Large Language Model (LLM). The generated pairs can be considered a collection of new patient cases since they are artificially created and may not exist in the original dataset. In contrast, the IntraAug branch uses newly generated reports to manipulate images. This process allows us to create new paired data for each individual with diverse medical conditions. Our extensive experiments on various downstream tasks covering medical image classification zero-shot and fine-tuning analysis demonstrate that our PairAug, concurrently expanding both image and text data, substantially outperforms image-/text-only expansion baselines and advanced medical VLP baselines. Our code is released at \url{https://github.com/YtongXie/PairAug}.
Paper Structure (20 sections, 8 equations, 8 figures, 5 tables)

This paper contains 20 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overall of Pairwise Augmentation (PairAug) pipeline, consisting of two branches: Inter-patient Augmentation (InterAug) and Intra-patient Augmentation (IntraAug). In InterAug, we first generate new reports $\hat{y}\in\Omega_a$ by a large language model $\mathcal{P}$ from original reports $y\in\Omega$. Then, we synthesise images $\hat{x}\in\Omega_a$ from the generated reports, followed by a data pruning method w.r.t. the semantic alignment between generated image-report pairs. As for IntraAug, we seek to generate images for the same individual but with different medical conditions. To this end, we reuse the same generation model $\mathcal{G}$ to synthesise images but swap the cross-attention map $M$ from the original report $y$ with that (i.e., $M'$) from the modified report $y'$ during the generation process. After that, we consider a data pruning method based on both synthetic pairs $(x',y')\in\Omega_e$ and original pairs $(x,y)\in\Omega$. Last, we merge $\Omega_{\hat{a}}$ and $\Omega_{e'}$ as the final synthetic paired data set $\Omega_{\tilde{s}}$.
  • Figure 2: Images synthesised from original and generated reports by (a) the T2I model in InterAug and (b) the T2I model in IntraAug with attention map swapping, respectively. * denotes images generated from the original reports rather than the original images.
  • Figure 3: T-SNE visualisation of image/report embeddings, comparing synthesised data from IntraAug and InterAug methods against real data from the MIMIC CXR dataset.
  • Figure 4: Radiology report before and after editing by ChatGPT and the corresponding images generated by our InterAug. We highlight the specific areas in the radiology image with red bounding boxes and the corresponding descriptions in reports with the same colour.
  • Figure 5: Radiology image-report pair synthesised via IntraAug.
  • ...and 3 more figures