Table of Contents
Fetching ...

LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier

Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang

TL;DR

This work tackles large vocabulary semantic segmentation by exploiting image-level classification data as coarse supervision to guide pixel-level learning. It introduces LarvSeg, a framework combining a simple joint-training baseline with a category-wise attentive classifier that uses a memory bank to apply region-specific supervision to novel categories, enabling open-vocabulary segmentation within a closed-vocabulary setting. Experiments on COCO-Stuff, ADE, and ImageNet21K show substantial gains in novel-category mIoU, including up to 6.0 mIoU on A150 and 2.1 mIoU on A847, and demonstration of a 21K-category segmentation model. The results suggest that balanced classification data and targeted region-level supervision are key to scaling semantic segmentation to very large vocabularies without extensive mask-annotation burdens.

Abstract

Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.

LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier

TL;DR

This work tackles large vocabulary semantic segmentation by exploiting image-level classification data as coarse supervision to guide pixel-level learning. It introduces LarvSeg, a framework combining a simple joint-training baseline with a category-wise attentive classifier that uses a memory bank to apply region-specific supervision to novel categories, enabling open-vocabulary segmentation within a closed-vocabulary setting. Experiments on COCO-Stuff, ADE, and ImageNet21K show substantial gains in novel-category mIoU, including up to 6.0 mIoU on A150 and 2.1 mIoU on A847, and demonstration of a 21K-category segmentation model. The results suggest that balanced classification data and targeted region-level supervision are key to scaling semantic segmentation to very large vocabularies without extensive mask-annotation burdens.

Abstract

Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.
Paper Structure (15 sections, 6 equations, 5 figures, 6 tables)

This paper contains 15 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of a new paradigm to address large vocabulary semantic segmentation with image classification data.
  • Figure 2: Visualization of the response maps for pixel grouping. The model is trained on C171 and the response maps are visualized on A150. The dots with different colours are the selected pixels for different categories. (a) denotes the image overlayed with the ground truth mask; (b) and (c) is the response maps of wall and window, which are inside the training vocabulary; (d), (e) and (f) are the response maps of sofa, radiator and painting, which are outside training vocabulary. We observe that (d), (e) and (f) present intra-category compactness as good as (b) and (c).
  • Figure 3: Illustration of LarvSeg framework. The meaning of each icon is listed on the left. CA-Classifier and CA-Map stand for category-wise attentive classifier and category-wise attention map defined in Section \ref{['def:ca_map']}. The proposed simple baseline learns from segmentation and image classification data simultaneously via pixel-level and image-level classification tasks (the losses are denoted as $\mathcal{L}_{\text{seg}}$ and $\mathcal{L}_{\text{cls}}$ in the figure). Additionally, the proposed category-wise attentive classifier maintains category-wise features with a memory bank to highlight the foreground pixel group and suppress background pixel groups. The attentively pooled score map is supervised by an auxiliary image-level classification task (the loss is denoted as $\mathcal{L}_{\text{aux}}$ in the figure).
  • Figure 4: Visualization of model predictions. The tags show model names and the corresponding mIoUs of this image. Circles with different colours represent regions with novel categories in the image: sofa (in the red circle), radiator (in the dark blue circle) and painting (in the light blue circle).
  • Figure 5: Visualization of 21K categories semantic segmentation.