Table of Contents
Fetching ...

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model

Yi-Chia Chen, Wei-Hua Li, Chu-Song Chen

TL;DR

The paper tackles open-vocabulary panoptic segmentation, where region-level labels may be unseen during training. It introduces OMTSeg, a BEiT-3 based multiway transformer framework that uses cross-modal attention between vision and language, augmented by a visual adapter and language prompting, to fuse visual and linguistic cues for dense segmentation. A Mask2Former–style multiway segmentation head and cosine similarity-based open-vocabulary classification enable labeling of both seen and unseen categories. Experiments across COCO Panoptic, ADE20K, Pascal Context, and Pascal VOC show competitive or state-of-the-art performance, with a smaller model footprint than some baselines, underscoring the benefit of cross-modal integration for open vocabulary segmentation.

Abstract

Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model

TL;DR

The paper tackles open-vocabulary panoptic segmentation, where region-level labels may be unseen during training. It introduces OMTSeg, a BEiT-3 based multiway transformer framework that uses cross-modal attention between vision and language, augmented by a visual adapter and language prompting, to fuse visual and linguistic cues for dense segmentation. A Mask2Former–style multiway segmentation head and cosine similarity-based open-vocabulary classification enable labeling of both seen and unseen categories. Experiments across COCO Panoptic, ADE20K, Pascal Context, and Pascal VOC show competitive or state-of-the-art performance, with a smaller model footprint than some baselines, underscoring the benefit of cross-modal integration for open vocabulary segmentation.

Abstract

Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

Paper Structure

This paper contains 18 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of OMTSeg, which contains BEiT-3 fusion encoder and a multiway segmentation head.
  • Figure 2: Overall architecture of Visual Adapter. (a) BEiT-3 Encoder for feature extraction and integrates visual and linguistic feature.(b) Visual adapter is constructed by stacking SPM, SFI, and MSFE. (c) SPM (d) SFI (e) MSFE.