Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model
Yi-Chia Chen, Wei-Hua Li, Chu-Song Chen
TL;DR
The paper tackles open-vocabulary panoptic segmentation, where region-level labels may be unseen during training. It introduces OMTSeg, a BEiT-3 based multiway transformer framework that uses cross-modal attention between vision and language, augmented by a visual adapter and language prompting, to fuse visual and linguistic cues for dense segmentation. A Mask2Former–style multiway segmentation head and cosine similarity-based open-vocabulary classification enable labeling of both seen and unseen categories. Experiments across COCO Panoptic, ADE20K, Pascal Context, and Pascal VOC show competitive or state-of-the-art performance, with a smaller model footprint than some baselines, underscoring the benefit of cross-modal integration for open vocabulary segmentation.
Abstract
Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.
