Table of Contents
Fetching ...

Context-Aware Multimodal Pretraining

Karsten Roth, Zeynep Akata, Dima Damen, Ivana Balažević, Olivier J. Hénaff

TL;DR

It is shown that vision-language models can be trained to exhibit significantly increased few-shot adaptation, and equipped with simple, training-free, metric-based adaptation mechanisms, these representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

Context-Aware Multimodal Pretraining

TL;DR

It is shown that vision-language models can be trained to exhibit significantly increased few-shot adaptation, and equipped with simple, training-free, metric-based adaptation mechanisms, these representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Context-aware multimodal pretraining facilitates few-shot transfer. Applying Tip-Adapter zhang2022tipadapter on a ViT-S/16 pretrained with and without our contextualized pretraining objective (here modifying SigLIP zhai2023siglip) showcases increases in test-time sample efficiency and overall few-shot performance while maintaining the underlying zero-shot transfer performance.
  • Figure 2: Dataset-level performance breakdown (32-shot, Tip-Adapter zhang2022tipadapter) for ViT-S/16 shows gains up to $+16.2\%$ on all 21 benchmarks; with each dataset improving by at least $+1.0\%$.
  • Figure 3: Significant gains across metric-based few-shot classifiers. Applying prototypical classification snell2017protonet, Tip-Adapters zhang2022tipadapter and nearest neighbor classifiers nakata2022knngeirhos2024flexibleperceptionvisualmemory on vision-backbones using context-aware pretraining significantly boosts $32$-shot results across the board (here ViT-S/16, 1.5B ex.).
  • Figure 4: Context-aware post-training. We apply SigLIxP on an already pretrained ViT-S/16 (1.5B examples), finetuning for +0.5B and +1B examples. We contrast the performance against a 6B ViT-S/16. Results indicate that context-aware finetuning can match much longer base pretraining with only +0.5B examples, and noticeably outperform it with just +1B examples - even if the base zero-shot transfer performance of the 6B reference model is much higher. Visualized results use Tip-Adapter for classification.
  • Figure 5: Contextualized pretraining particularly benefits many-shot transfer. For all 21 evaluation benchmarks, we plot the absolute number of examples (shots/class $\times$$\#$classes) against the relative gain when switching to SigLIxP. Results shown are for 32 shots/class. We find consistent relative gain in all scenarios, which become higher as absolute example counts increases.
  • ...and 1 more figures