Table of Contents
Fetching ...

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao

TL;DR

OmniBind tackles the challenge of scalable omni multimodal representation by binding a diverse set of pre-trained spaces onto a shared backbone (EVA-CLIP-18B) and employing per-modality routers to dynamically weight their contributions. It scales from 7B to 30B parameters by reusing unpaired unimodal data and 14 spaces, achieving state-of-the-art results across 13 benchmarks and enabling applications like 3D–audio retrieval and any-query localization. Two learning objectives—cross-modal overall alignment and language representation decoupling—guide the routers to maximize cross-modal coherence while preserving language discrimination, all with efficient training. The work demonstrates that scaling through space binding is a viable path to versatile, high-performance omni representations with practical compute and data requirements, opening avenues for broad multimodal understanding and generation tasks.

Abstract

Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

TL;DR

OmniBind tackles the challenge of scalable omni multimodal representation by binding a diverse set of pre-trained spaces onto a shared backbone (EVA-CLIP-18B) and employing per-modality routers to dynamically weight their contributions. It scales from 7B to 30B parameters by reusing unpaired unimodal data and 14 spaces, achieving state-of-the-art results across 13 benchmarks and enabling applications like 3D–audio retrieval and any-query localization. Two learning objectives—cross-modal overall alignment and language representation decoupling—guide the routers to maximize cross-modal coherence while preserving language discrimination, all with efficient training. The work demonstrates that scaling through space binding is a viable path to versatile, high-performance omni representations with practical compute and data requirements, opening avenues for broad multimodal understanding and generation tasks.

Abstract

Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.
Paper Structure (19 sections, 7 equations, 6 figures, 8 tables)

This paper contains 19 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of OmniBind. OmniBind integrates diverse knowledge of various existing multimodal models, leading to large-scale omni representations. OmniBind exhibits remarkable versatility and achieves state-of-the-art results on extensive downstream tasks over all modality pairs.
  • Figure 2: The pipeline of OmniBind. The $\Theta_{X}$ denotes the router of modality $X$, and $E_{X}^i$ is the $i$-th encoder of modality $X$. The losses $L_{align}$ and $L_{dec}$ are the objectives for training the routers.
  • Figure 3: Qualitative comparison of audio to 3D object retrieval. More visualizations of 3D-audio retrieval are provided in the Appendix \ref{['sec:visualization']}.
  • Figure 4: Diverse applications enabled by OmniBind.
  • Figure 5: More visualization of 3D-to-Audio retrieval.
  • ...and 1 more figures