Transductive Zero-Shot and Few-Shot CLIP
Ségolène Martin, Yunshi Huang, Fereshteh Shakeri, Jean-Christophe Pesquet, Ismail Ben Ayed
TL;DR
This work extends CLIP with a transductive framework for zero-shot and few-shot classification by modeling the text-vision probability features on the unit simplex with class-wise Dirichlet distributions. It introduces the EM-Dirichlet algorithm, a block Majorization-Minimization approach that jointly updates Dirichlet parameters and per-sample assignments, including a MDL-based partition penalty and a soft-assignment update. The method yields substantial gains across 11 datasets, including near 20% absolute improvement on ImageNet zero-shot with batches of 75 queries, and competitive gains in the few-shot regime, outperforming several transductive baselines. The work establishes theoretical and empirical connections to EM, demonstrates the importance of Dirichlet-based simplex modeling for transductive vision-language inference, and suggests future extensions to segmentation and out-of-distribution detection.
Abstract
Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: https://github.com/SegoleneMartin/transductive-CLIP.
