Table of Contents
Fetching ...

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Jishnu Jaykumar P, Kamalesh Palanisamy, Yu-Wei Chao, Xinya Du, Yu Xiang

TL;DR

Proto-CLIP addresses the challenge of few-shot object recognition by fusing vision and language through prototypical networks. It builds image and text prototypes from CLIP embeddings, adapts them with learnable memories and adapters, and enforces cross-modal alignment with InfoNCE losses. The approach yields training-free and fine-tuned variants that outperform many CLIP-based baselines on diverse benchmarks and demonstrate practical viability in a robot perception workflow. This work highlights the benefit of joint image–text representations and prototype alignment for robust, data-efficient learning in real-world robotics.

Abstract

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

TL;DR

Proto-CLIP addresses the challenge of few-shot object recognition by fusing vision and language through prototypical networks. It builds image and text prototypes from CLIP embeddings, adapts them with learnable memories and adapters, and enforces cross-modal alignment with InfoNCE losses. The approach yields training-free and fine-tuned variants that outperform many CLIP-based baselines on diverse benchmarks and demonstrate practical viability in a robot perception workflow. This work highlights the benefit of joint image–text representations and prototype alignment for robust, data-efficient learning in real-world robotics.

Abstract

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP
Paper Structure (8 sections, 7 equations, 5 figures, 6 tables)

This paper contains 8 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our Proto-CLIP model learns a joint embedding space of images and text, where image and text prototypes formed using support sets are learned and aligned for few-shot classification.
  • Figure 2: Overview of our proposed Proto-CLIP model. The image memory, the text memory and the adapter network are learned. Given a class name, $\tau_i$ returns the $i^{th}$ out of $\tilde{K}$ predefined text prompts.
  • Figure 3: Two designs of the adapters. (a) A Multi-layer perceptron-based adapter as in gao2021clip. (b) A convolution-based adapter that we introduce. The feature dimension is for CLIP ResNet50 backbone.
  • Figure 4: Barnes-Hut t-SNE visualization van2014accelerating using the FewSOL dataset p2023fewsol. (a) Image and text prototypes from zero-shot CLIP, which are not aligned. (b) Aligned image and text prototypes from Proto-CLIP-$F$.
  • Figure 5: Results for the real world setup with top-5 predictions from the Proto-CLIP-$F$ (ViT-L/14) model trained on FewSOL-198 p2023fewsol. The Speech-To-Text is performed via Whisper radford2022_whisper.