Table of Contents
Fetching ...

Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

Buddhi Wijenayake, Nichula Wasalathilake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake, Vishal M. Patel

TL;DR

The paper addresses long-tail pixel imbalance and domain shift between Urban and Rural LoveDA splits in high-resolution remote-sensing semantic segmentation. It introduces a two-stage prompt-controlled diffusion pipeline: Stage A uses a domain- and ratio-conditioned discrete layout diffusion (D3PM) to generate label maps with targeted class proportions, and Stage B employs a layout-guided latent diffusion with ControlNet to render photorealistic, domain-consistent images from those layouts. A greedy enrichment strategy yields roughly 2000 synthetic label–image pairs, which are mixed with real LoveDA data to train multiple segmentation backbones, with notable gains for minority classes and improved cross-domain generalization. The work demonstrates that controllable generative augmentation is a practical approach to mitigating long-tail bias in remote-sensing segmentation and can complement existing augmentation and loss-based methods, enabling more robust urban/rural land-cover mapping.

Abstract

Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label--image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \href{https://github.com/Buddhi19/SyntheticGen.git}{Github}

Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

TL;DR

The paper addresses long-tail pixel imbalance and domain shift between Urban and Rural LoveDA splits in high-resolution remote-sensing semantic segmentation. It introduces a two-stage prompt-controlled diffusion pipeline: Stage A uses a domain- and ratio-conditioned discrete layout diffusion (D3PM) to generate label maps with targeted class proportions, and Stage B employs a layout-guided latent diffusion with ControlNet to render photorealistic, domain-consistent images from those layouts. A greedy enrichment strategy yields roughly 2000 synthetic label–image pairs, which are mixed with real LoveDA data to train multiple segmentation backbones, with notable gains for minority classes and improved cross-domain generalization. The work demonstrates that controllable generative augmentation is a practical approach to mitigating long-tail bias in remote-sensing segmentation and can complement existing augmentation and loss-based methods, enabling more robust urban/rural land-cover mapping.

Abstract

Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label--image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \href{https://github.com/Buddhi19/SyntheticGen.git}{Github}
Paper Structure (16 sections, 10 equations, 4 figures, 1 table)

This paper contains 16 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Dataset balancing and prompt-controllable synthesis on LoveDA. (a) Pixel-frequency distributions for Rural, Urban, and the combined training set, comparing the original data (solid) against our augmented dataset (hatched). (b--f) Representative synthesized image--label pairs generated under explicit domain (Urban/Rural) and class-ratio constraints, illustrating controllable diffusion for both domain-consistent appearance and targeted semantic proportions.
  • Figure 2: Stage A: domain and ratio conditioned discrete diffusion (D3PM) for semantic layout generation. A U-Net denoiser predicts categorical logits from a noisy label map conditioned on a masked class-ratio target and Urban/Rural domain embedding.
  • Figure 3: Stage B: layout-guided latent diffusion for image synthesis. A Stable Diffusion U-Net is guided by ControlNet features from the layout, with FiLM-gated residual injection and a domain/ratio prompt for domain-consistent appearance.
  • Figure 4: Prompt-controlled inference pipeline. The prompt is parsed into domain and ratio targets, a layout is sampled with Stage A, and a photorealistic image is generated with Stage B using the sampled layout as spatial guidance.