Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation
Ruochen Pi, Lianlei Shan
TL;DR
This paper tackles the data annotation bottleneck in medical lung X-ray segmentation by introducing DiffMask, a diffusion-based pipeline that uses cross-attention between text prompts and images to generate synthetic images and their semantic masks. It couples adaptive thresholding with DenseCRF and AffinityNet refinements to produce high-quality masks and bridges synthetic and real data through data augmentation and retrieval-based prompts. Experiments show that segmentation models trained on synthetic data achieve IoU scores comparable to or better than those trained on real data, across architectures like UNet and TransUnet, with notable gains when real data is scarce. The approach demonstrates zero-shot capabilities for unseen classes and offers a scalable path to reduce annotation costs, potentially transforming medical image analysis workflows.
Abstract
Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.
