Table of Contents
Fetching ...

Transductive Zero-Shot and Few-Shot CLIP

Ségolène Martin, Yunshi Huang, Fereshteh Shakeri, Jean-Christophe Pesquet, Ismail Ben Ayed

TL;DR

This work extends CLIP with a transductive framework for zero-shot and few-shot classification by modeling the text-vision probability features on the unit simplex with class-wise Dirichlet distributions. It introduces the EM-Dirichlet algorithm, a block Majorization-Minimization approach that jointly updates Dirichlet parameters and per-sample assignments, including a MDL-based partition penalty and a soft-assignment update. The method yields substantial gains across 11 datasets, including near 20% absolute improvement on ImageNet zero-shot with batches of 75 queries, and competitive gains in the few-shot regime, outperforming several transductive baselines. The work establishes theoretical and empirical connections to EM, demonstrates the importance of Dirichlet-based simplex modeling for transductive vision-language inference, and suggests future extensions to segmentation and out-of-distribution detection.

Abstract

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.

Transductive Zero-Shot and Few-Shot CLIP

TL;DR

This work extends CLIP with a transductive framework for zero-shot and few-shot classification by modeling the text-vision probability features on the unit simplex with class-wise Dirichlet distributions. It introduces the EM-Dirichlet algorithm, a block Majorization-Minimization approach that jointly updates Dirichlet parameters and per-sample assignments, including a MDL-based partition penalty and a soft-assignment update. The method yields substantial gains across 11 datasets, including near 20% absolute improvement on ImageNet zero-shot with batches of 75 queries, and competitive gains in the few-shot regime, outperforming several transductive baselines. The work establishes theoretical and empirical connections to EM, demonstrates the importance of Dirichlet-based simplex modeling for transductive vision-language inference, and suggests future extensions to segmentation and out-of-distribution detection.

Abstract

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.
Paper Structure (34 sections, 3 theorems, 34 equations, 6 figures, 6 tables)

This paper contains 34 sections, 3 theorems, 34 equations, 6 figures, 6 tables.

Key Result

Lemma 1

Let $\varphi = \ln \Gamma( \cdot +1)$. Then, for any $\boldsymbol{\beta}_k = (\beta_{k, i})_{1\leq i \leq K} \in (0, +\infty)^K$, the function $q(\,\cdot\,; \boldsymbol{\beta}_k)$ defined as, for every $\boldsymbol{\alpha}_k\in (0, +\infty)^K)$, is a tangent majorant of $F_k$ at $\boldsymbol{\beta}_k$, where the function $c$ is defined by

Figures (6)

  • Figure 1: Given a transductive few-shot task, both visual and textual information are extracted from the images and class-wise prompts. The embeddings are next combined into vision-text probability vectors. Classification is carried out on the simplex set of $\mathbb{R}^K$ using the labels of the support set. An empty support set corresponds the the zero-shot scenario, which is akin to a clustering problem.
  • Figure 2: Examples of Dirichlet distributions on the simplex of $\mathbb{R}^3$, for $\boldsymbol{\alpha}=(10, 5.0, 5.0)$ (left) and $\boldsymbol{\alpha}=(0.975, 0.975, 3.0)$ (right)
  • Figure 3: Illustration of the bipartite matching for class assignment.
  • Figure 4: Average accuracy on the 11 datasets as a function of the number of samples in the query set, over 1,000 tasks generated following the protocol described in Section \ref{['sec:zero_shot_generation']}. As anticipated, the efficiency of transduction increases with the number of samples in the query set.
  • Figure 5: Accuracy versus shots for seven methods from Table \ref{['table:few-shot']} on SUN397, ImageNet, and the average across the 11 datasets.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Lemma 1: Majorant of the negative log-likelihood
  • Proposition 1
  • Lemma 2: erdogan2002monotonic
  • proof
  • proof