Table of Contents
Fetching ...

Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

Xianing Chen, Si Huo, Borui Jiang, Hailin Hu, Xinghao Chen

TL;DR

This work tackles the challenge of single-domain generalization in few-shot counting by revealing that prototypes learned from a narrow source distribution hinder cross-domain performance. It introduces Universal Representation Matching (URM), which distills universal vision-language representations from CLIP into object prototypes and uses cross-attention to build a robust correlation map for density regression. By incorporating both universal vision and language representations, URM achieves state-of-the-art results on FSC-147 and FSCD-LVIS in cross-domain and zero-shot settings, while maintaining in-domain performance. The approach also leverages language prompts generated by LVLMs to enable training without predefined category names, broadening applicability in open-world scenarios.

Abstract

Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.

Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

TL;DR

This work tackles the challenge of single-domain generalization in few-shot counting by revealing that prototypes learned from a narrow source distribution hinder cross-domain performance. It introduces Universal Representation Matching (URM), which distills universal vision-language representations from CLIP into object prototypes and uses cross-attention to build a robust correlation map for density regression. By incorporating both universal vision and language representations, URM achieves state-of-the-art results on FSC-147 and FSCD-LVIS in cross-domain and zero-shot settings, while maintaining in-domain performance. The approach also leverages language prompts generated by LVLMs to enable training without predefined category names, broadening applicability in open-world scenarios.

Abstract

Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.

Paper Structure

This paper contains 36 sections, 6 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Illustration of (a) domain generalization for few-shot counting, (b) the vanilla extract-then-match pipeline, and (c) our proposed universal representation matching.
  • Figure 2: t-SNE visualization of the prototypes feature space for different categories from FSC147 by (a) the vanilla paradigm trained on FSC147, (b) the vanilla paradigm trained on FSCD-LVIS, and (c) our proposed URM trained on FSCD-LVIS. Note that the visualization is conduct on the test set where the object categories are disjoint from the train set. Best viewed in color.
  • Figure 3: The framework of our proposed URM. The inference architecture is depicted in the gray part, where the learned prototypes are matched with the image feature through cross attention. The yellow part illustrates the universal V-L representations obtained from CLIP, which are distilled into the prototypes exclusively during the training phase.
  • Figure 4: Illustration of the prompt encoder.
  • Figure 5: Visualization of the segmentation.