Table of Contents
Fetching ...

SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning

Hongjun Wang, Sagar Vaze, Kai Han

TL;DR

SPTNet addresses Generalized Category Discovery by coupling data parameter learning through Spatial Prompt Tuning with model parameter fine-tuning in a two-stage EM-inspired framework. It introduces per-patch pixel-space prompts and a global prompt to align representations from a pre-trained Vision Transformer with unseen categories, using alternating data- and model-update steps. The approach achieves state-of-the-art results across seven datasets, notably 61.4% average accuracy on the Semantic Shift Benchmark with only 0.117% extra backbone parameters, demonstrating efficient open-world generalization. The work highlights learned augmentation via spatial prompts as an effective alternative to full fine-tuning for open-world recognition tasks, with implications for scalable deployment on diverse datasets.

Abstract

Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.

SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning

TL;DR

SPTNet addresses Generalized Category Discovery by coupling data parameter learning through Spatial Prompt Tuning with model parameter fine-tuning in a two-stage EM-inspired framework. It introduces per-patch pixel-space prompts and a global prompt to align representations from a pre-trained Vision Transformer with unseen categories, using alternating data- and model-update steps. The approach achieves state-of-the-art results across seven datasets, notably 61.4% average accuracy on the Semantic Shift Benchmark with only 0.117% extra backbone parameters, demonstrating efficient open-world generalization. The work highlights learned augmentation via spatial prompts as an effective alternative to full fine-tuning for open-world recognition tasks, with implications for scalable deployment on diverse datasets.

Abstract

Generalized Category Discovery (GCD) aims to classify unlabelled images from both `seen' and `unseen' classes by transferring knowledge from a set of labelled `seen' class images. A key theme in existing GCD approaches is adapting large-scale pre-trained models for the GCD task. An alternate perspective, however, is to adapt the data representation itself for better alignment with the pre-trained model. As such, in this paper, we introduce a two-stage adaptation approach termed SPTNet, which iteratively optimizes model parameters (i.e., model-finetuning) and data parameters (i.e., prompt learning). Furthermore, we propose a novel spatial prompt tuning method (SPT) which considers the spatial property of image data, enabling the method to better focus on object parts, which can transfer between seen and unseen classes. We thoroughly evaluate our SPTNet on standard benchmarks and demonstrate that our method outperforms existing GCD methods. Notably, we find our method achieves an average accuracy of 61.4% on the SSB, surpassing prior state-of-the-art methods by approximately 10%. The improvement is particularly remarkable as our method yields extra parameters amounting to only 0.117% of those in the backbone architecture. Project page: https://visual-ai.github.io/sptnet.
Paper Structure (25 sections, 8 equations, 22 figures, 13 tables)

This paper contains 25 sections, 8 equations, 22 figures, 13 tables.

Figures (22)

  • Figure 1: The overall framework of SPTNet. SPTNet alternates between data parameter tuning (stage one) and model parameter tuning (stage two). The data parameters are learnable prompts, for which we introduce spatial prompts $P_s$. The model parameters include the parameters of the top layer of the Transformer backbone $\mathcal{F}$ and a projection head $\mathcal{H}$.
  • Figure 2: (a) An example of applying Spatial Prompt Tuning (SPT) to an image with a height $H$ and width $W$. For each image patch $x^{j}$ with a height $h$ and width $w$, we attach spatial prompts $P_s$ of size $m$ to it. (b) Joint spatial and global prompts for SPTNet.
  • Figure 3: Effects of different choices of alternating frequency (a) and prompt size (b) on SSB (i.e., CUB, Stanford Cars and FGVC-Aircraft). We report the averaged results and show the influence on 'All', 'Old' and 'New' classes.
  • Figure 4: t-SNE visualization of representations on CIFAR-10. SPTNet produces the most discriminative representations among all compared methods.
  • Figure 5: Attention visualization of different heads (numbered as $h_1$ to $h_{12}$). The top 10% attended patches are shown in red.
  • ...and 17 more figures