Table of Contents
Fetching ...

Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong

Abstract

Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

Rethinking Vector Field Learning for Generative Segmentation

Abstract

Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.
Paper Structure (16 sections, 19 equations, 6 figures, 3 tables)

This paper contains 16 sections, 19 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visualization of gradient vanishing and trajectory traversing in generative segmentation (x-prediction). (a) Vanilla flow matching suffers from vanishing gradients near semantic centroids $\mu$, resulting in slow and non-discriminative trajectories that may traverse proximal neighborhoods of competing centroids and cause false predictions. (b) FlowSeg (ours) introduces a potential function $\Phi$ to enhance gradients around target centroids and enforce repulsion from non-targets, enabling faster convergence and more discriminative, deflected trajectories. The $x - \mu$ term maintains convergence near outer boundaries. (c) Gradient norm from centroid to decision boundary (gray regions in (a),(b)): Our method maintains strong gradients, whereas vanilla flow matching gradients nearly vanish. Yellow curves and blue dashed lines denote the predicted trajectory and decision boundary; the green dot marks the target centroid, and blue crosses indicate irrelevant categories.
  • Figure 2: Comparison of diffusion segmentation paradigms. (a) Diffusion models are used primarily for mask refinement, relying on an external backbone for feature extraction or auxiliary networks for coarse segmentation. (b) Diffusion models serve as the backbone, followed by a dedicated segmentation head. (c) The segmentation task is formulated as image-to-mask translation without auxiliary networks, yet still depends on a pretrained VAE. (d) FlowSeg (ours) performs pixel-level end-to-end training without additional auxiliary modules, and rectifies vanilla flow matching by reshaping the underlying vector field for better optimization. Noise is omitted for simplicity.
  • Figure 3: Visualization of segmentation results on (a) ADE20K and (b) COCO-Stuff datasets. Color white in the ground truth (GT) denotes the ignored regions. As ADE20K and COCO-Stuff datasets have different category cardinality, the same color between (a) and (b) does not necessarily represent the same semantic category.
  • Figure 4: Visual comparisons between FlowSeg (ours) and SymmFlow (Baseline). The diffusion model first predicts pseudo-masks (Raw), then maps them to the nearest semantic centroids to obtain the final masks (Map). (a) Comparison between deterministic (ours) and stochastic modeling: SymmFlow’s predictions vary with random seeds, while ours remain consistent. (b) VAE-based latent space modeling produces masks with similar colors that may not correspond to the correct semantic categories, due to imperfect alignment with pixel-level centroids.
  • Figure 5: Convergence comparison of different training recipes. (a) FlowSeg vs. vanilla flow matching. (b) Training w/ REPA vs. w/o REPA, (c) Different transformation operators $\mathcal{T}$.
  • ...and 1 more figures