Applying ViT in Generalized Few-shot Semantic Segmentation
Liyuan Geng, Jinhong Xia, Yuanhe Guo
TL;DR
The paper investigates applying Vision Transformer (ViT) backbones to Generalized Few-shot Semantic Segmentation (GFSS), comparing ResNet and ViT-based models with various decoders. Using base training on base classes and inference with augmented decoders, it reveals that DINOv2-backed models paired with a Linear Classifier often outperform ResNet-based setups on PASCAL-$5^i$, while Mask Transformer decoders are prone to overfitting in GFSS. The study highlights the strong few-shot learning capability of large ViT models, particularly DINOv2, and identifies a caveat: pure ViT-based encoders with large decoders may overfit, requiring careful architectural choices. Overall, the work demonstrates the potential of ViT-based GFSS systems and provides practical guidance on backbone-decoder combinations for better base-novel class generalization.
Abstract
This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.
