Table of Contents
Fetching ...

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, Shengchuan Zhang

TL;DR

This work addresses the efficiency bottlenecks of open-vocabulary panoptic segmentation by introducing EOV-Seg, a single-stage, shared framework that leverages a Vocabulary-Aware Selection (VAS) module and Two-way Dynamic Embedding Experts (TDEE). By combining a lightweight, spatially aware decoder with CLIP-based backbones and prompt-based text embeddings, EOV-Seg delivers competitive panoptic and semantic performance with significantly improved inference speed (e.g., 11.6 FPS on ADE20K and 23.8 FPS with ResNet50 on a RTX 3090) and a small parameter footprint. Ablation studies validate the necessity of VAS and TDEE, showing notable gains over baselines and confirming that spatial awareness is crucial for efficient open-vocabulary segmentation. Overall, the approach offers a practical, scalable path toward real-time open-vocabulary panoptic understanding in diverse scenes.

Abstract

Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.5 PQ, 32.1 mIoU, and 11.6 FPS on the ADE20K dataset and the inference time of EOV-Seg is 4-19 times faster than state-of-theart methods. Especially, equipped with ResNet50 backbone, EOV-Seg runs 23.8 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at https://github.com/nhw649/EOV-Seg.

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

TL;DR

This work addresses the efficiency bottlenecks of open-vocabulary panoptic segmentation by introducing EOV-Seg, a single-stage, shared framework that leverages a Vocabulary-Aware Selection (VAS) module and Two-way Dynamic Embedding Experts (TDEE). By combining a lightweight, spatially aware decoder with CLIP-based backbones and prompt-based text embeddings, EOV-Seg delivers competitive panoptic and semantic performance with significantly improved inference speed (e.g., 11.6 FPS on ADE20K and 23.8 FPS with ResNet50 on a RTX 3090) and a small parameter footprint. Ablation studies validate the necessity of VAS and TDEE, showing notable gains over baselines and confirming that spatial awareness is crucial for efficient open-vocabulary segmentation. Overall, the approach offers a practical, scalable path toward real-time open-vocabulary panoptic understanding in diverse scenes.

Abstract

Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.5 PQ, 32.1 mIoU, and 11.6 FPS on the ADE20K dataset and the inference time of EOV-Seg is 4-19 times faster than state-of-theart methods. Especially, equipped with ResNet50 backbone, EOV-Seg runs 23.8 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at https://github.com/nhw649/EOV-Seg.

Paper Structure

This paper contains 19 sections, 7 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparisons between three macro-frameworks for open-vocabulary panoptic segmentation: (a) two-stage, inefficient, non-shared, lack of spatial awareness framework; (b) single-stage, inefficient, shared, lack of spatial awareness framework; (c) Our proposed framework; (d) FPS w.r.t. PQ on the ADE20K datasets for open-vocabulary panoptic segmentation; (e) FPS w.r.t. mIoU on the ADE20K datasets for open-vocabulary semantic segmentation.
  • Figure 2: Visualization of Grad-CAM jacobgilpytorch-cam and K-means clustering of CNN-based CLIP Radford2021CLIP backbone features (ConvNeXt-L). (a) Input Image; (b) Ground Truth; (c) Grad-CAM shows the features focusing on local instances; (d) K-means clustering features produce semantically meaningful clusters.
  • Figure 3: Visualization of K-means clustering of backbone features concerning different blocks across various VFMs.
  • Figure 4: Overview of EOV-Seg. First, the initial block of the ViT-based CLIP is used as a spatial awareness extractor to obtain spatial awareness features $F_s$. Then, the visual-semantic aggregated features $\hat{F}_{agg}$ generated by VAS module, the masks generated by the light-weight decoder, and the spatial awareness features $F_s$ are fed into mask pooling and TDEE sequentially to obtain instance embeddings $\hat{E}_I$, which will be used to calculate cosine similarity with $E_t$ for classification.
  • Figure 5: Visualization of similarity map in VAS.
  • ...and 6 more figures