Table of Contents
Fetching ...

FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR

Open-vocabulary semantic segmentation hinges on fine-grained pixel-text alignment beyond image-level VLM pretraining. FGASeg introduces a Pixel-Level Alignment Module (P2Tformer) and a Text-Pixel Alignment Loss (T2Ploss) to convert coarse vision-text alignment into precise pixel-level semantics, while the Global and Local Category Supplementation (GCS/LCS) module provides boundary cues through pseudo-masks. The approach, validated on COCO-Stuff, ADE20K, and open-setup benchmarks with ViT backbones, consistently improves mIoU over state-of-the-art methods and scales without external data; it also enables inference acceleration via a streamlined decoder and TopK class selection. Together, these components deliver robust open-vocabulary segmentation with sharp boundaries and efficient inference, advancing practical deployment in diverse visual domains.

Abstract

Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.

FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

TL;DR

Open-vocabulary semantic segmentation hinges on fine-grained pixel-text alignment beyond image-level VLM pretraining. FGASeg introduces a Pixel-Level Alignment Module (P2Tformer) and a Text-Pixel Alignment Loss (T2Ploss) to convert coarse vision-text alignment into precise pixel-level semantics, while the Global and Local Category Supplementation (GCS/LCS) module provides boundary cues through pseudo-masks. The approach, validated on COCO-Stuff, ADE20K, and open-setup benchmarks with ViT backbones, consistently improves mIoU over state-of-the-art methods and scales without external data; it also enables inference acceleration via a streamlined decoder and TopK class selection. Together, these components deliver robust open-vocabulary segmentation with sharp boundaries and efficient inference, advancing practical deployment in diverse visual domains.

Abstract

Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.
Paper Structure (32 sections, 13 equations, 7 figures, 7 tables)

This paper contains 32 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of Image-Level Pretraining and Pixel-Level Alignment. (a) Image-Level Pretraining aligns image and text embeddings via contrastive learning, while (b) Pixel-Level Alignment incorporates a pixel-level transformer, alignment loss and category information supplement to achieve finer-grained alignment, bridging the gap for open-vocabulary segmentation (OVS).
  • Figure 2: Overall architecture of FGAseg. (a) Pixel-Level Alignment Module: The P2Tformer aligns tokens in a pixel-text manner, while the T2Ploss enforces precise text-pixel alignment through local alignment. (b) Global Category Supplementation (GCS) and Local Category Supplementation (LCS) provide category boundary information as pseudo-masks for guidance. (c) Global and Local Category Supplementation Propagation incorporates GCS and LCS as pseudo-masks into pixel-level classification.
  • Figure 3: Pixel-Level Alignment Module. This module refines pixel-text alignment by using multi-head attention and MLP layers across multiple P2Tformer layers. Vision and text tokens are processed to capture cross-modal correlations, and the result is scaled by a learnable parameter $\gamma$, to balance the modalities. Aligned text tokens slide over vision tokens, creating alignment matrices for computing alignment loss ($\mathcal{L}_{\text{align}}$) via T2Ploss, enhancing segmentation precision by preserving category boundaries. This fine-grained pixel-text alignment enhances the model’s ability to capture category boundaries.
  • Figure 4: Illustration of (a) Global Category Supplementation(GCS) and (b) Local Category Supplementation (LCS). (a) Global category information is obtained by computing the cosine similarity,(b) Local category information is obtained by applying a text convolutional kernel to slide over the image features
  • Figure 5: Decoder details. (a) During training, the local similarity matrix is processed through concatenation, a linear layer, and a convolutional layer, with an auxiliary loss component. (b) In inference, TopK selection is applied to the local similarity matrix, followed by similar layers as in training.
  • ...and 2 more figures