Vision Learners Meet Web Image-Text Pairs

Bingchen Zhao; Quan Cui; Hao Wu; Osamu Yoshie; Cheng Yang; Oisin Mac Aodha

Vision Learners Meet Web Image-Text Pairs

Bingchen Zhao, Quan Cui, Hao Wu, Osamu Yoshie, Cheng Yang, Oisin Mac Aodha

TL;DR

The paper investigates self-supervised learning on large-scale, noisy web image-text data and finds that generative pre-training outperforms discriminative and that existing multi-modal discriminative approaches do not surpass single-modal methods. It introduces MUG, a multi-modal generative pre-training framework that learns from image-text pairs by jointly reconstructing images and generating captions, optimizing a combined loss to maximize the joint information $I(X^V,X^L;Z)$. The authors provide an information-theoretic rationale for why generative and multi-modal objectives can yield more transferable representations and demonstrate state-of-the-art transfer across ImageNet-1K, ADE20K, and other benchmarks, with favorable scaling properties when increasing pre-training data. The work highlights the value of jointly modeling the joint distribution of vision and language in a purely generative, multi-modal setting and offers insights for designing scalable, robust vision learners with web data.

Abstract

Many self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.

Vision Learners Meet Web Image-Text Pairs

TL;DR

. The authors provide an information-theoretic rationale for why generative and multi-modal objectives can yield more transferable representations and demonstrate state-of-the-art transfer across ImageNet-1K, ADE20K, and other benchmarks, with favorable scaling properties when increasing pre-training data. The work highlights the value of jointly modeling the joint distribution of vision and language in a purely generative, multi-modal setting and offers insights for designing scalable, robust vision learners with web data.

Abstract

Paper Structure (26 sections, 8 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 8 equations, 7 figures, 16 tables, 1 algorithm.

Introduction
Related Work
Vision Learners with Contrastive Learning
Vision Learners with Masked Image Modeling
Vision Learners with Web Image-Text Pairs
Preliminaries
Benchmarking SSL Methods on Web Data
Key Observations
Approach
Motivation
MUG: MUlti-modal Generator
Understanding the Loss Function
Experiments
Implementation Details
Transfer Learning
...and 11 more sections

Figures (7)

Figure 1: Comparison of different vision pre-training paradigms that use images or image-text pairs. Four paradigms are considered: (i) single-modal discriminative (e.g., SimCLR simclr), (ii) single-modal generative (e.g., MAE MAE), (iii) multi-modal discriminative (e.g., CLIP radford2021learning), and (iv) our proposed multi-modal generative approach named MUG. The multi-modal generative paradigm simultaneously generates images and text using only image representations.
Figure 2: Left: Single/multi-modal discriminative methods have a narrow bottleneck and thus learn a less informative representation. Middle: Single-modal generative methods have a wide bottleneck and thus learn a more informative representation. Right: Multi-modal generative methods have a wider bottleneck for generating (e.g., recovering) the joint distribution of both modalities, and as a result learn an even more informative representation.
Figure 3: Illustration of our multi-modal self-supervised approach MUG.
Figure 4: Reconstructions of masked images and captions from our MUG approach from the MS-COCO (top) and PASCAL-VOC (bottom) datasets.
Figure 5: Uncurated random samples on COCO images. For each triplet, we show the masked image (left), MUG reconstruction (middle), the ground-truth (right), and the generated captions by MUG. The masking ratio is 75%.
...and 2 more figures

Vision Learners Meet Web Image-Text Pairs

TL;DR

Abstract

Vision Learners Meet Web Image-Text Pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)