Transductive Zero-Shot and Few-Shot CLIP

Ségolène Martin; Yunshi Huang; Fereshteh Shakeri; Jean-Christophe Pesquet; Ismail Ben Ayed

Transductive Zero-Shot and Few-Shot CLIP

Ségolène Martin, Yunshi Huang, Fereshteh Shakeri, Jean-Christophe Pesquet, Ismail Ben Ayed

TL;DR

This work extends CLIP with a transductive framework for zero-shot and few-shot classification by modeling the text-vision probability features on the unit simplex with class-wise Dirichlet distributions. It introduces the EM-Dirichlet algorithm, a block Majorization-Minimization approach that jointly updates Dirichlet parameters and per-sample assignments, including a MDL-based partition penalty and a soft-assignment update. The method yields substantial gains across 11 datasets, including near 20% absolute improvement on ImageNet zero-shot with batches of 75 queries, and competitive gains in the few-shot regime, outperforming several transductive baselines. The work establishes theoretical and empirical connections to EM, demonstrates the importance of Dirichlet-based simplex modeling for transductive vision-language inference, and suggests future extensions to segmentation and out-of-distribution detection.

Abstract

Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.

Transductive Zero-Shot and Few-Shot CLIP

TL;DR

Abstract

Paper Structure (34 sections, 3 theorems, 34 equations, 6 figures, 6 tables)

This paper contains 34 sections, 3 theorems, 34 equations, 6 figures, 6 tables.

Introduction
Related works
Vision-language models
Few-shot classification
Inductive v.s. transductive setting
Few-shot CLIP
Proposed method
Computing informative feature vectors
Data distribution
Simplex-based classification criterion
Proposed algorithm
Minimization step w.r.t Dirichlet parameter
Minimization step w.r.t assignment variable
Global algorithm and class-assignment
Links with other clustering and transductive few-shot objectives
...and 19 more sections

Key Result

Lemma 1

Let $\varphi = \ln \Gamma( \cdot +1)$. Then, for any $\boldsymbol{\beta}_k = (\beta_{k, i})_{1\leq i \leq K} \in (0, +\infty)^K$, the function $q(\,\cdot\,; \boldsymbol{\beta}_k)$ defined as, for every $\boldsymbol{\alpha}_k\in (0, +\infty)^K)$, is a tangent majorant of $F_k$ at $\boldsymbol{\beta}_k$, where the function $c$ is defined by

Figures (6)

Figure 1: Given a transductive few-shot task, both visual and textual information are extracted from the images and class-wise prompts. The embeddings are next combined into vision-text probability vectors. Classification is carried out on the simplex set of $\mathbb{R}^K$ using the labels of the support set. An empty support set corresponds the the zero-shot scenario, which is akin to a clustering problem.
Figure 2: Examples of Dirichlet distributions on the simplex of $\mathbb{R}^3$, for $\boldsymbol{\alpha}=(10, 5.0, 5.0)$ (left) and $\boldsymbol{\alpha}=(0.975, 0.975, 3.0)$ (right)
Figure 3: Illustration of the bipartite matching for class assignment.
Figure 4: Average accuracy on the 11 datasets as a function of the number of samples in the query set, over 1,000 tasks generated following the protocol described in Section \ref{['sec:zero_shot_generation']}. As anticipated, the efficiency of transduction increases with the number of samples in the query set.
Figure 5: Accuracy versus shots for seven methods from Table \ref{['table:few-shot']} on SUN397, ImageNet, and the average across the 11 datasets.
...and 1 more figures

Theorems & Definitions (5)

Lemma 1: Majorant of the negative log-likelihood
Proposition 1
Lemma 2: erdogan2002monotonic
proof
proof

Transductive Zero-Shot and Few-Shot CLIP

TL;DR

Abstract

Transductive Zero-Shot and Few-Shot CLIP

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)