Table of Contents
Fetching ...

CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

Rupak Bose, Chinedu Innocent Nwoye, Aditya Bhat, Nicolas Padoy

TL;DR

CoSimGen tackles the challenge of generating high-quality paired image and segmentation masks under flexible conditioning. It introduces Spectron for spatio-spectral embedding fusion and Textron for text-grounded conditioning within a diffusion framework, augmented by a contrastive triplet loss and an adversarial regularizer, followed by a super-resolution module to reach 512×512 outputs. Across four diverse datasets, it achieves state-of-the-art metrics such as low KID and LPIPS while ensuring semantic alignment between images and masks, enabling robust data augmentation and domain adaptation. The work highlights the practical impact of controllable, multimodal data generation for medical, geospatial, and general computer vision tasks, and discusses stability considerations and data requirements for diffusion-based approaches.

Abstract

The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

TL;DR

CoSimGen tackles the challenge of generating high-quality paired image and segmentation masks under flexible conditioning. It introduces Spectron for spatio-spectral embedding fusion and Textron for text-grounded conditioning within a diffusion framework, augmented by a contrastive triplet loss and an adversarial regularizer, followed by a super-resolution module to reach 512×512 outputs. Across four diverse datasets, it achieves state-of-the-art metrics such as low KID and LPIPS while ensuring semantic alignment between images and masks, enabling robust data augmentation and domain adaptation. The work highlights the practical impact of controllable, multimodal data generation for medical, geospatial, and general computer vision tasks, and discusses stability considerations and data requirements for diffusion-based approaches.

Abstract

The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.

Paper Structure

This paper contains 27 sections, 9 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: CoSimGen takes either text or class vector as input prompt and generates a high-resolution image minimally representing the prompt context and a mask segmenting all the objects in the prompt.
  • Figure 2: Architecture of CosimGen - showing the input conditioning, the diffusion process and super resolution of the generated outputs.
  • Figure 3: Spectron: spatio-spectral embedding fusion for semantic context and temporal feature conditioning.
  • Figure 4: Textron: text-grounding of semantic class via contrastive triplet loss for text/class conditioning, enabling hot-swapping of text and class conditional input prompts during inference.
  • Figure 5: Qualitative comparisons of generated image-mask pairs
  • ...and 9 more figures