Applying ViT in Generalized Few-shot Semantic Segmentation

Liyuan Geng; Jinhong Xia; Yuanhe Guo

Applying ViT in Generalized Few-shot Semantic Segmentation

Liyuan Geng, Jinhong Xia, Yuanhe Guo

TL;DR

The paper investigates applying Vision Transformer (ViT) backbones to Generalized Few-shot Semantic Segmentation (GFSS), comparing ResNet and ViT-based models with various decoders. Using base training on base classes and inference with augmented decoders, it reveals that DINOv2-backed models paired with a Linear Classifier often outperform ResNet-based setups on PASCAL-$5^i$, while Mask Transformer decoders are prone to overfitting in GFSS. The study highlights the strong few-shot learning capability of large ViT models, particularly DINOv2, and identifies a caveat: pure ViT-based encoders with large decoders may overfit, requiring careful architectural choices. Overall, the work demonstrates the potential of ViT-based GFSS systems and provides practical guidance on backbone-decoder combinations for better base-novel class generalization.

Abstract

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Applying ViT in Generalized Few-shot Semantic Segmentation

TL;DR

, while Mask Transformer decoders are prone to overfitting in GFSS. The study highlights the strong few-shot learning capability of large ViT models, particularly DINOv2, and identifies a caveat: pure ViT-based encoders with large decoders may overfit, requiring careful architectural choices. Overall, the work demonstrates the potential of ViT-based GFSS systems and provides practical guidance on backbone-decoder combinations for better base-novel class generalization.

Abstract

, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Paper Structure (27 sections, 1 equation, 3 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 1 equation, 3 figures, 1 table, 1 algorithm.

Introduction
Backbone architecture.
Loss Function.
Contributions.
Related Work
Few-shot segmentation.
Generalized Few-shot segmentation.
Visual Transformer.
Visual Transformer in Segmentation.
Method
Base Training
ResNet
Vision Transformer
Inference
Aligning support labels and predictions.
...and 12 more sections

Figures (3)

Figure 1: Overview of our structure for generalized few-shot segmentation. The upper-half shows the process of aligning support labels and predictions, while the lower-half representing the mask prediction process for the query image. The disentanglement between Encoder and Decoder enables us to experiment on different model combinations: ResNet34+Linear Classifier, ResNet34+UperNet, ResNet50+Linear Classifier, ResNet50+UperNet, DINO+Linear Classifier, DINO+Mask Transformer, DINOv2+Linear Classifier, DINOv2+Mask Transformer.
Figure 2: Qualitative results on 1-shot inference on PASCAL$-5^0$ for different encoder-decoder architectures. Models learn from support sets on novel classes airplane, bicycle, bird, boat, bottle. Models will predict base and novel classes later.
Figure 3: Example of results on 5-shot inference on PASCAL$-5^0$ using ResNet34 with linear classifier. Models mistake background pixels as novel classes.

Applying ViT in Generalized Few-shot Semantic Segmentation

TL;DR

Abstract

Applying ViT in Generalized Few-shot Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)