Table of Contents
Fetching ...

GenMask: Adapting DiT for Segmentation via Direct Mask Generation

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang

Abstract

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

GenMask: Adapting DiT for Segmentation via Direct Mask Generation

Abstract

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.
Paper Structure (15 sections, 8 equations, 7 figures, 6 tables)

This paper contains 15 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overall architecture of our model. Here brown and green denote the segmentation and generation data flow respectively. Generation task follows standard diffusion training process, and its timesteps sampling strategy is similar to Stable Diffusion 3 Esser2024SD3, emphasizing on intermediate denoising steps. For segmentation task, we use an extreme long tailed sampling strategy. We also add VAE representation of the input image into DiT to supplement low-level information for segmentation.
  • Figure 2: The process of adding noise to natural image and binary mask. Compared with natural image, binary mask is much more robust to noise.
  • Figure 3: VAE features for binary segmentation masks are linearly separable.(a) The input segmentation masks. (b) PCA label of its VAE representation. (c) The difference between the input masks and the PCA label after histogram normalization.
  • Figure 4: SVM validation accuracy on VAE embeddings of binary masks under different noise levels. The embeddings remain linearly separable under low noise, while only high-intensity perturbations substantially degrade separability.
  • Figure 5: (Up) Importance resampling function for segmentation task with different hyperparameter $a$. More extreme $a$ value means the distribution is more concentrated in high noise intensity regime. (Down) Different sampling strategies for segmentation and generation task separately. Generation task uses a relatively uniform sampling strategy, only emphasize the intermediate denoising steps. While segmentation task uses a extreme long tailed sampling strategy, with peak value 8× higher.
  • ...and 2 more figures