Table of Contents
Fetching ...

Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation

Ruochen Pi, Lianlei Shan

TL;DR

This paper tackles the data annotation bottleneck in medical lung X-ray segmentation by introducing DiffMask, a diffusion-based pipeline that uses cross-attention between text prompts and images to generate synthetic images and their semantic masks. It couples adaptive thresholding with DenseCRF and AffinityNet refinements to produce high-quality masks and bridges synthetic and real data through data augmentation and retrieval-based prompts. Experiments show that segmentation models trained on synthetic data achieve IoU scores comparable to or better than those trained on real data, across architectures like UNet and TransUnet, with notable gains when real data is scarce. The approach demonstrates zero-shot capabilities for unseen classes and offers a scalable path to reduce annotation costs, potentially transforming medical image analysis workflows.

Abstract

Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.

Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation

TL;DR

This paper tackles the data annotation bottleneck in medical lung X-ray segmentation by introducing DiffMask, a diffusion-based pipeline that uses cross-attention between text prompts and images to generate synthetic images and their semantic masks. It couples adaptive thresholding with DenseCRF and AffinityNet refinements to produce high-quality masks and bridges synthetic and real data through data augmentation and retrieval-based prompts. Experiments show that segmentation models trained on synthetic data achieve IoU scores comparable to or better than those trained on real data, across architectures like UNet and TransUnet, with notable gains when real data is scarce. The approach demonstrates zero-shot capabilities for unseen classes and offers a scalable path to reduce annotation costs, potentially transforming medical image analysis workflows.

Abstract

Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.

Paper Structure

This paper contains 23 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the Method: The workflow of our method, highlights the integration of cross-attention mechanisms in generative models to bridge synthetic and real data for seamless mask generation and segmentation training.
  • Figure 2: Method 1: Cross-fusion of DenseCRF and Affinity modules synergistically refines mask quality by combining spatial coherence with semantic optimization.
  • Figure 3: Method 2: Sequential application of Affinity for semantic drafting followed by DenseCRF for spatial and visual refinement achieves a precise and polished final mask.
  • Figure 4: Segmentation Networks Training: The diagram shows a process where features from generated data and refinements from real data are integrated to enhance model output.