Table of Contents
Fetching ...

Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

Hongduan Tian, Feng Liu, Zhanke Zhou, Tongliang Liu, Chengqi Zhang, Bo Han

TL;DR

This paper proposes a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts and indicates that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

Abstract

In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in such a framework is that the prototype and image instance embeddings share the same representation transformation. However, in this paper, we find that there naturally exists a gap, which resembles the modality gap, between the prototype and image instance embeddings extracted from the frozen pre-trained backbone, and simply applying the same transformation during the adaptation phase constrains exploring the optimal representations and shrinks the gap between prototype and image representations. To solve this problem, we propose a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts. Extensive experiments on Meta-Dataset demonstrate that CoPA achieves the state-of-the-art performance more efficiently. Meanwhile, further analyses also indicate that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

TL;DR

This paper proposes a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts and indicates that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

Abstract

In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in such a framework is that the prototype and image instance embeddings share the same representation transformation. However, in this paper, we find that there naturally exists a gap, which resembles the modality gap, between the prototype and image instance embeddings extracted from the frozen pre-trained backbone, and simply applying the same transformation during the adaptation phase constrains exploring the optimal representations and shrinks the gap between prototype and image representations. To solve this problem, we propose a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts. Extensive experiments on Meta-Dataset demonstrate that CoPA achieves the state-of-the-art performance more efficiently. Meanwhile, further analyses also indicate that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

Paper Structure

This paper contains 47 sections, 4 theorems, 22 equations, 27 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.1

Let the measure $d(\cdot, \cdot)$ be the cosine similarity function. Given a set of normalized finite support data representation $\mathcal{Z}=\{(\boldsymbol{z}_i, y_i)\}_{i=1}^{n}$, where $||\boldsymbol{z}||_2=1$ for $\forall \boldsymbol{z}\in \mathcal{Z}$ and $N_C$ classes are included, then we ha where $\boldsymbol{z}^{\prime}$ is an independent copy of samples in $\mathcal{Z}$, $\mathcal{C}_{c

Figures (27)

  • Figure 1: There naturally exists a gap between prototype and image instance embeddings, but applying the same transformation shrinks such a gap. Fig.(a) shows that there naturally exists a gap, which resembles the "modality" gap in visual language models, between prototype and image instance embeddings extracted from a frozen pre-trained backbone. However, Fig.(b) shows that the gap between the representations of prototypes and image instances is shrunk after applying the same representation transformation to both the image instance and prototype embeddings.
  • Figure 2: The upper subfigure shows the URL pipeline which applies the same transformation to both prototype and image instance embeddings. The bottom subfigure shows the pipeline of CoPA, which tries to adapt two different representation transformation heads respectively for prototypes and image instances in the way of CLIP via substituting text prompts with prototype embeddings.
  • Figure 3: (a). The global minimum validation loss is achieved when the "modality" gap is enlarged. Fig. (a) depicts the validation loss landscape w.r.t the changes of the "modality" gap between prototype and image instance embeddings. The validation loss fails to achieve the global minimum at the original gap, and the global minimum can be achieved when the gap is enlarged. (b)-(c). The shared representation transformation fails to learn compact instance representation clusters. According to the visualization results of both prototype and image instance embeddings and their representations obtained with the same representation transformation, compared to the prototype and image instance embeddings extracted from the frozen pre-trained backbone (Fig. (b)), the shared transformation fails to learn image instance representations which are well clustered (Fig. (c)).
  • Figure 4: The change of the scale of the upper bound of URL representation gaps during the adaptation.
  • Figure 5: (a). The gap between prototype and image instance representations is enlarged from 0.22 to 1.38 by CoPA. Such a phenomenon is consistent with that demonstrated by mindgap. (b). The clusters of image instance representations learned from CoPA. The more compact clusters reveal that CoPA learns better instance representations. (c). The validation loss achieves its global minimum at the gap learned by CoPA, which indicates that CoPA can improve the generalization performance.
  • ...and 22 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • Theorem 3.2: The shared transformation
  • Theorem 5.1
  • proof
  • Lemma G.1
  • proof
  • proof
  • proof