Vision Learners Meet Web Image-Text Pairs
Bingchen Zhao, Quan Cui, Hao Wu, Osamu Yoshie, Cheng Yang, Oisin Mac Aodha
TL;DR
The paper investigates self-supervised learning on large-scale, noisy web image-text data and finds that generative pre-training outperforms discriminative and that existing multi-modal discriminative approaches do not surpass single-modal methods. It introduces MUG, a multi-modal generative pre-training framework that learns from image-text pairs by jointly reconstructing images and generating captions, optimizing a combined loss to maximize the joint information $I(X^V,X^L;Z)$. The authors provide an information-theoretic rationale for why generative and multi-modal objectives can yield more transferable representations and demonstrate state-of-the-art transfer across ImageNet-1K, ADE20K, and other benchmarks, with favorable scaling properties when increasing pre-training data. The work highlights the value of jointly modeling the joint distribution of vision and language in a purely generative, multi-modal setting and offers insights for designing scalable, robust vision learners with web data.
Abstract
Many self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.
