Table of Contents
Fetching ...

Applying ViT in Generalized Few-shot Semantic Segmentation

Liyuan Geng, Jinhong Xia, Yuanhe Guo

TL;DR

The paper investigates applying Vision Transformer (ViT) backbones to Generalized Few-shot Semantic Segmentation (GFSS), comparing ResNet and ViT-based models with various decoders. Using base training on base classes and inference with augmented decoders, it reveals that DINOv2-backed models paired with a Linear Classifier often outperform ResNet-based setups on PASCAL-$5^i$, while Mask Transformer decoders are prone to overfitting in GFSS. The study highlights the strong few-shot learning capability of large ViT models, particularly DINOv2, and identifies a caveat: pure ViT-based encoders with large decoders may overfit, requiring careful architectural choices. Overall, the work demonstrates the potential of ViT-based GFSS systems and provides practical guidance on backbone-decoder combinations for better base-novel class generalization.

Abstract

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Applying ViT in Generalized Few-shot Semantic Segmentation

TL;DR

The paper investigates applying Vision Transformer (ViT) backbones to Generalized Few-shot Semantic Segmentation (GFSS), comparing ResNet and ViT-based models with various decoders. Using base training on base classes and inference with augmented decoders, it reveals that DINOv2-backed models paired with a Linear Classifier often outperform ResNet-based setups on PASCAL-, while Mask Transformer decoders are prone to overfitting in GFSS. The study highlights the strong few-shot learning capability of large ViT models, particularly DINOv2, and identifies a caveat: pure ViT-based encoders with large decoders may overfit, requiring careful architectural choices. Overall, the work demonstrates the potential of ViT-based GFSS systems and provides practical guidance on backbone-decoder combinations for better base-novel class generalization.

Abstract

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.
Paper Structure (27 sections, 1 equation, 3 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 1 equation, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our structure for generalized few-shot segmentation. The upper-half shows the process of aligning support labels and predictions, while the lower-half representing the mask prediction process for the query image. The disentanglement between Encoder and Decoder enables us to experiment on different model combinations: ResNet34+Linear Classifier, ResNet34+UperNet, ResNet50+Linear Classifier, ResNet50+UperNet, DINO+Linear Classifier, DINO+Mask Transformer, DINOv2+Linear Classifier, DINOv2+Mask Transformer.
  • Figure 2: Qualitative results on 1-shot inference on PASCAL$-5^0$ for different encoder-decoder architectures. Models learn from support sets on novel classes airplane, bicycle, bird, boat, bottle. Models will predict base and novel classes later.
  • Figure 3: Example of results on 5-shot inference on PASCAL$-5^0$ using ResNet34 with linear classifier. Models mistake background pixels as novel classes.