Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Constance Ferragu; Philomene Chagniot; Vincent Coyette

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Constance Ferragu, Philomene Chagniot, Vincent Coyette

TL;DR

The paper investigates whether a multimodal foundation model trained on vast external data can directly perform meta-few-shot image classification without additional training. It introduces three inference strategies—textual, visual, and stacked—that exploit CLIP's joint image-text space to discriminate among N classes with k-shot support, evaluated across CIFAR-FS, MiniImageNet, and Meta-Dataset. Results show that CLIP-based inferences often surpass or match state-of-the-art meta-few-shot methods, with the stacked approach providing robustness across datasets and domain shifts. The findings highlight the potential of foundation-model priors as strong baselines for few-shot learning and lay a baseline for future multimodal meta-learning research, while also outlining practical considerations around data overlap and calibration. Overall, the work demonstrates that no-training CLIP inference can yield strong performance and serve as a versatile benchmark for future approaches leveraging multimodal representations.

Abstract

In recent literature, few-shot classification has predominantly been defined by the N-way k-shot meta-learning problem. Models designed for this purpose are usually trained to excel on standard benchmarks following a restricted setup, excluding the use of external data. Given the recent advancements in large language and vision models, a question naturally arises: can these models directly perform well on meta-few-shot learning benchmarks? Multimodal foundation models like CLIP, which learn a joint (image, text) embedding, are of particular interest. Indeed, multimodal training has proven to enhance model robustness, especially regarding ambiguities, a limitation frequently observed in the few-shot setup. This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks, all without additional training. Our results confirm the potential and robustness of multimodal foundation models like CLIP and serve as a baseline for existing and future approaches leveraging such models.

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

TL;DR

Abstract

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

Authors

TL;DR

Abstract

Table of Contents