Table of Contents
Fetching ...

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Huy Che, Vinh-Tiep Nguyen

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.

Paper Structure

This paper contains 29 sections, 9 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Open-vocabulary semantic segmentation results using FA-Seg. We visualize predictions on challenging scenarios, including contextual images, camouflaged objects, and internet-collected images. The highlighted results demonstrate that, in addition to accurate segmentation of both stuff and things, our method effectively handles camouflaged objects, distinguishes between different characters or dog breeds, and differentiates objects based on color variations.
  • Figure 2: Overview of the proposed FA-Seg. The input consists of both the original and horizontally flipped images, which are first reconstructed via DDIM inversion to obtain their latent representations. In parallel, a text prompt $\mathcal{P}$ and a class prompt $\mathcal{P'}$ are generated to guide inversion and extract class-specific cross-attention maps corresponding to candidate classes (e.g., bus, motorbike, sheep). Cross-attention and self-attention maps at multiple resolutions are then extracted, transformed, and fused via weighted aggregation, where the self-attention maps are used to refine the class-specific cross-attention maps. Final segmentation masks are derived from the refined per-class score maps. Test-Time Flipping (TTF) is applied during inference to enhance robustness: attention maps from the flipped image are spatially realigned and averaged with those from the original image to improve prediction reliability.
  • Figure 3: Qualitative results for the classes in the class prompt $\mathcal{P}'$. Instead of including only the candidate classes that appear in the image (highlighted with blue bounding boxes), we also include distractor classes (highlighted with red bounding boxes). The results show that while the candidate classes present in the image are visualized accurately, the distractor classes yield incorrect or noisy attention maps.
  • Figure 4: Effect of a number of DDIM inversion steps on reconstruction quality, where $I_0$ denotes the original input image. Increasing inversion steps improves reconstruction fidelity.
  • Figure 4: mIoU results from threshold 0.4 to 0.65 for COCO, Context, and VOC datasets
  • ...and 6 more figures