Table of Contents
Fetching ...

Explore the Limits of Omni-modal Pretraining at Scale

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue

TL;DR

This work introduces MiCo, a scalable omni-modal pretraining framework that jointly learns universal representations across diverse modalities by aligning knowledge modalities with a natural interface language via a two-branch architecture. It constructs a unified multimodal context using shared position embeddings and modality-specific cues, and optimizes with three complementary objectives: omni-modal contrastive learning, omni-modal feature matching, and omni-modal caption generation, including precise formulations for each loss. Empirically, MiCo achieves 37 new SOTA records across 10 modalities, 25 cross-modal benchmarks, and 18 multimodal LLM benchmarks, demonstrating strong single-modality, cross-modal, and LLM-augmented performance, including zero-shot reasoning capabilities when paired with LLMs. The results suggest that scaling modalities, data, and model parameters within the MiCo paradigm substantially enhances omni-modal understanding and transfer, with practical implications for building versatile foundation models across vision, language, audio, and 3D modalities.

Abstract

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Explore the Limits of Omni-modal Pretraining at Scale

TL;DR

This work introduces MiCo, a scalable omni-modal pretraining framework that jointly learns universal representations across diverse modalities by aligning knowledge modalities with a natural interface language via a two-branch architecture. It constructs a unified multimodal context using shared position embeddings and modality-specific cues, and optimizes with three complementary objectives: omni-modal contrastive learning, omni-modal feature matching, and omni-modal caption generation, including precise formulations for each loss. Empirically, MiCo achieves 37 new SOTA records across 10 modalities, 25 cross-modal benchmarks, and 18 multimodal LLM benchmarks, demonstrating strong single-modality, cross-modal, and LLM-augmented performance, including zero-shot reasoning capabilities when paired with LLMs. The results suggest that scaling modalities, data, and model parameters within the MiCo paradigm substantially enhances omni-modal understanding and transfer, with practical implications for building versatile foundation models across vision, language, audio, and 3D modalities.

Abstract

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo
Paper Structure (21 sections, 6 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 21 sections, 6 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: Omni-modal Pretraining. We propose collecting large-scale omni-modal paired data, including text, image, video, depth, and normal maps, to learn universal representations.
  • Figure 2: Multimedia Cognition Process in Brain Inspires our Design. We split diverse modalities into two types and employ individual neural networks to learn representations from each type respectively.
  • Figure 3: Evolution of Pretraining Paradigms. Masked modeling he2022maskedhuang2022maskeddevlin2018bert has shown great success in single-modality general-purpose understanding. Contrastive learning he2020momentumradford2021learningchen2020simple distinguishes transferable features with modality tuples. We aim to achieve general-purpose omni-modal understanding and learn transferable, universal representations.
  • Figure 4: Options of Architecture Design for Omni-Modal Pretraining.
  • Figure 5: Overview of Multimodal Context Pretraining Paradigm. We use a shared ViT for multimodal feature extraction, and another branch is to employ a text encoder. We concatenate these multimodal sequences as multimodal contexts and perform contrastive learning and masked modeling.
  • ...and 1 more figures