Table of Contents
Fetching ...

Boosting Vision-Language Models with Transduction

Maxime Zanella, Benoît Gérin, Ismail Ben Ayed

TL;DR

TransCLIP addresses the challenge of improving zero-/few-shot generalization in Vision-Language Models by introducing a transductive framework that jointly leverages unlabeled test data, language knowledge, and data structure. It formulates a text-guided KL penalty within a GMM-Laplacian objective and optimizes it with a Block Majorize-Minimize procedure that yields decoupled, scalable z-updates and closed-form updates for class means and diagonals. Empirical results across 11 datasets and multiple backbones show consistent gains over inductive baselines and competitive or superior performance to existing transductive methods, with particular strength in transfer and domain-generalization scenarios. The approach remains plug-and-play, scalable to large models, and benefits from language supervision, highlighting the practical value of incorporating textual knowledge into transductive learning for multimodal systems.

Abstract

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.

Boosting Vision-Language Models with Transduction

TL;DR

TransCLIP addresses the challenge of improving zero-/few-shot generalization in Vision-Language Models by introducing a transductive framework that jointly leverages unlabeled test data, language knowledge, and data structure. It formulates a text-guided KL penalty within a GMM-Laplacian objective and optimizes it with a Block Majorize-Minimize procedure that yields decoupled, scalable z-updates and closed-form updates for class means and diagonals. Empirical results across 11 datasets and multiple backbones show consistent gains over inductive baselines and competitive or superior performance to existing transductive methods, with particular strength in transfer and domain-generalization scenarios. The approach remains plug-and-play, scalable to large models, and benefits from language supervision, highlighting the practical value of incorporating textual knowledge into transductive learning for multimodal systems.

Abstract

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.
Paper Structure (42 sections, 1 theorem, 10 equations, 1 figure, 23 tables)

This paper contains 42 sections, 1 theorem, 10 equations, 1 figure, 23 tables.

Key Result

Theorem 1

Assume that, for each block, the majorizing function is quasi-convex, and its first-order behavior is the same as the original objective locally. Furthermore, assume that the sub-problem solved for each block has a unique solution. Then, every limit point of the iterates generated by BMM is a coordi

Figures (1)

  • Figure 1: TransCLIP improves significantly the averaged top-1 accuracy on 11 datasets when used on top of inductive zero-shot CLIP, 2-shot CoOp prompt tuning and 2-shot TaskRes adapter for various encoder sizes.

Theorems & Definitions (1)

  • Theorem 1: Convergence of BMM Razaviyayn