Table of Contents
Fetching ...

Dynamic Multimodal Prototype Learning in Vision-Language Models

Xingyu Zhu, Shuo Wang, Beier Zhu, Miaoge Li, Yunfan Li, Junfeng Fang, Zhicai Wang, Dongsheng Wang, Hanwang Zhang

TL;DR

Ambiguities in class names hinder textual prototypes in vision-language models, motivating multimodal prototype learning. ProtoMM is a training-free framework that constructs multimodal prototypes from textual descriptions and dynamically updated visual particles, using optimal transport to fuse evidence from test streams. It demonstrates improvements across 15 zero-shot benchmarks, including ImageNet variants, without gradient-based tuning. The method leverages a visual cache and Sinkhorn OT to progressively incorporate visual knowledge, enabling robust generalization to unseen data in a streaming test-time setting.

Abstract

With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.

Dynamic Multimodal Prototype Learning in Vision-Language Models

TL;DR

Ambiguities in class names hinder textual prototypes in vision-language models, motivating multimodal prototype learning. ProtoMM is a training-free framework that constructs multimodal prototypes from textual descriptions and dynamically updated visual particles, using optimal transport to fuse evidence from test streams. It demonstrates improvements across 15 zero-shot benchmarks, including ImageNet variants, without gradient-based tuning. The method leverages a visual cache and Sinkhorn OT to progressively incorporate visual knowledge, enabling robust generalization to unseen data in a streaming test-time setting.

Abstract

With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.

Paper Structure

This paper contains 14 sections, 10 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Observations of ambiguities in class names from the Oxford Flowers Flower and ImageNet ImageNet: (a) The class names "sword lily" and "blackberry lily" both refer to flowers from the Iridaceae family and share the word "lily". (b) The class names "laptop" and "desktop computer" denote different types of computers.
  • Figure 2: Illustration of multimodal prototype learning on the ImageNet dataset ImageNet. (a) Process of updating multimodal prototypes. (b) Comparisons of distribution metrics.
  • Figure 3: An framework of the proposed method (ProtoMM), which consists of two modules, i.e., (a) Distributed Feature Construction: Expand the textual prototypes with visual features from testing samples. (b) Multimodal Prototype Learning: Updating the multimodal prototypes through the transport plan for the next prediction.
  • Figure 4: Analysis of classification performance by varying the number of augmentations in the ImageNet dataset: (a) image augmentation, (b) class name augmentation.
  • Figure 5: Analysis of classification performance by varying the threshold value on the ImageNet dataset.
  • ...and 2 more figures