Table of Contents
Fetching ...

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan

TL;DR

This work proposes a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities, and achieves SOTA results with an embarrassingly simple linear classifier for vision-language adaptation.

Abstract

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for ${\bf cross-modal}$ ${\bf adaptation}$: we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

TL;DR

This work proposes a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities, and achieves SOTA results with an embarrassingly simple linear classifier for vision-language adaptation.

Abstract

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better dog classifier by ing about dogs and ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP learn cross-modal encoders that map different modalities to the same representation space. Specifically, we propose a simple strategy for : we treat examples from different modalities as additional few-shot examples. For example, by simply repurposing class names as an additional training sample, we trivially turn any n-shot learning problem into a (n+1)-shot problem. This allows us to produce SOTA results with embarrassingly simple linear classifiers. We show that our approach can be combined with existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
Paper Structure (9 sections, 8 equations, 6 figures, 18 tables, 1 algorithm)

This paper contains 9 sections, 8 equations, 6 figures, 18 tables, 1 algorithm.

Figures (6)

  • Figure 1: Human perception is cross-modal. Our work is loosely inspired by neuroscience studies that suggest that neurons can be triggered from stimuli from different modalities, such as vision, audio, or even language gibson1969principlesmeltzoff1979intermodalNanay2018-NANMMI. In this work, we propose to leverage such cross-modality representations to adapt multimodal models (such as CLIP radford2021learning and AudioCLIP guzhov2021audioclip) for few-shot learning with a simple but effective strategy; we learn (non)linear classifiers built on top of few shot examples that span different modalities, including vision, audio, and language (Fig. \ref{['fig:pca_teaser']}).
  • Figure 2: Adding additional modalities helps few-shot learning. Adding textual labels to a 2-shot cat-vs-dog classification task leads to better test performance (by turning the problem into a 3-shot cross-modal task!). We visualize cross-modal CLIP gao2021clip features (projection to 2D with principal component analysis) and the resulting classifier learned from them, and observe a large shift in the decision boundary. See \ref{['fig:pcas']} for more examples.
  • Figure 3: Cross-modality reduces the ambiguity of few-shot learning. Classic (uni-modal) few-shot learning is often underspecified. Even for binary classification, when given only a single image per class ( left), it is unclear whether the target class is the animal, the hat, or the background scene. Adding an extra modality, such as text or audio, helps clarify the problem setup ( right). Notably, language usually comes "for free" in classification datasets in the form of a textual label per class.
  • Figure 4: Uni-modal (left) vs. cross-modal adaptation (right) for a binary cat-vs-dog classification task. Prior work zhou2022coopzhang2021tipgao2021clipwortsman2022robust optimizes over a loss from a single modality. Cross-modal adaptation makes use of additional training samples from other modalities, exploiting pre-trained encoders that map different modalities to the same representation space. We show that cross-modal learning can also improve prior art and even extends to audio modalities with AudioCLIP guzhov2021audioclip.
  • Figure 5: Additional PCA projection plots for random pairs of classes in ImageNet deng2009imagenet. Adding one-shot text as training samples can oftentimes aggressively shift the decision boundary.
  • ...and 1 more figures