Table of Contents
Fetching ...

SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

Yiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming Huang

TL;DR

SDVPT tackles open-world object counting by addressing the generalization gap to unseen categories through semantic-driven visual prompt tuning. It introduces CSPI to initialize category-specific prompts and TGPR to transfer text-embedding topology into visual prompts, followed by inference-time synthesis of prompts for unseen classes based on semantic similarity. The framework is plug-and-play, improving multiple base counting models across FSC-147, CARPK, and PUCPR+ with modest overhead, and achieving new state-of-the-art results on several benchmarks. By preserving vision-language alignment and explicitly modeling topological relations, SDVPT enhances zero-shot counting while remaining efficient for real-world deployment.

Abstract

Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.

SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

TL;DR

SDVPT tackles open-world object counting by addressing the generalization gap to unseen categories through semantic-driven visual prompt tuning. It introduces CSPI to initialize category-specific prompts and TGPR to transfer text-embedding topology into visual prompts, followed by inference-time synthesis of prompts for unseen classes based on semantic similarity. The framework is plug-and-play, improving multiple base counting models across FSC-147, CARPK, and PUCPR+ with modest overhead, and achieving new state-of-the-art results on several benchmarks. By preserving vision-language alignment and explicitly modeling topological relations, SDVPT enhances zero-shot counting while remaining efficient for real-world deployment.

Abstract

Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the open-world object counting pipeline and limitations of naive fine-tuning strategies. (a) Open-world object counting depends on text-image alignment to enable user interaction and decoding. (b) Naive fine tuning and visual prompt tuning strategies neglect text-image alignment for unseen categories, resulting in inaccurate predictions during testing.
  • Figure 2: Main architecture of the proposed method. Our method comprises CSPI and TGPR modules. The CSPI in Sec. \ref{['CSPI']} trains a set of category-specific visual prompts, while the TGPR in Sec. \ref{['TGPR']} transfers the topological structure of text embeddings onto them. For inference, we employ an aggregation strategy similar to TGPR to harness the topological structure of text embeddings spanning unseen and training categories, thereby extending knowledge from the training set to unseen categories.
  • Figure 3: Illustration of visual prompt integration. (a) For ViT, we embed the prompt between the $cls$ token and the image embedding. (b) For Swin Transformer, we insert the visual prompt before Window Multi-Head Self-Attention (W-MSA) and Shifted Window Multi-Head Self-Attention (SW-MSA), removing it during the patch merging stage.
  • Figure 4: Illustration of the learning objectives for CSPI and TGPR. (a) CSPI enforces alignment between text and visual embeddings within categories, disregarding the topological structure among visual embeddings. (b) TGPR harmonizes the topological structure of visual embeddings with that of text embeddings.
  • Figure 5: Joint embedding space of VPT and SDVPT on training and test sets, obtained by dimensionality reduction using Linear Discriminant Analysis (LDA).
  • ...and 5 more figures