Table of Contents
Fetching ...

OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All

Yuanhuiyi Lyu, Xu Zheng, Dahun Kim, Lin Wang

TL;DR

OmniBind tackles modality mismatch and data-scale imbalance by transferring knowledge from data-rich teacher modalities (image and text) to data-scarce student modalities (touch, thermal, event, etc.) through Cross-modal Alignment Distillation (CAD). It then learns a unified representation space for any modality combination using Adaptive Fusion (AF), aided by a modality-free dataset spanning seven modalities. The method yields consistent improvements in arbitrary modality configurations ($+4.05\%$ on average) and strong single-modality gains (e.g., touch $+4.34\%$), demonstrating robust omni-bind capability and flexible sensor fusion in dynamic environments. This work advances practical multimodal perception for adaptive autonomous systems and provides a new modality-free dataset and evaluation protocol for omni-bind research.

Abstract

Research on multi-modal learning dominantly aligns the modalities in a unified space at training, and only a single one is taken for prediction at inference. However, for a real machine, e.g., a robot, sensors could be added or removed at any time. Thus, it is crucial to enable the machine to tackle the mismatch and unequal-scale problems of modality combinations between training and inference. In this paper, we tackle these problems from a new perspective: "Modalities Help Modalities". Intuitively, we present OmniBind, a novel two-stage learning framework that can achieve any modality combinations and interaction. It involves teaching data-constrained, a.k.a, student, modalities to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities. This subtly enables the adaptive fusion of any modalities to build a unified representation space for any combinations. Specifically, we propose Cross-modal Alignment Distillation (CAD) to address the unequal-scale problem between student and teacher modalities and effectively align student modalities into the teacher modalities' representation space in stage one. We then propose an Adaptive Fusion (AF) module to fuse any modality combinations and learn a unified representation space in stage two. To address the mismatch problem, we aggregate existing datasets and combine samples from different modalities by the same semantics. This way, we build the first dataset for training and evaluation that consists of teacher (image, text) and student (touch, thermal, event, point cloud, audio) modalities and enables omni-bind for any of them. Extensive experiments on the recognition task show performance gains over prior arts by an average of 4.05 % on the arbitrary modality combination setting. It also achieves state-of-the-art performance for a single modality, e.g., touch, with a 4.34 % gain.

OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All

TL;DR

OmniBind tackles modality mismatch and data-scale imbalance by transferring knowledge from data-rich teacher modalities (image and text) to data-scarce student modalities (touch, thermal, event, etc.) through Cross-modal Alignment Distillation (CAD). It then learns a unified representation space for any modality combination using Adaptive Fusion (AF), aided by a modality-free dataset spanning seven modalities. The method yields consistent improvements in arbitrary modality configurations ( on average) and strong single-modality gains (e.g., touch ), demonstrating robust omni-bind capability and flexible sensor fusion in dynamic environments. This work advances practical multimodal perception for adaptive autonomous systems and provides a new modality-free dataset and evaluation protocol for omni-bind research.

Abstract

Research on multi-modal learning dominantly aligns the modalities in a unified space at training, and only a single one is taken for prediction at inference. However, for a real machine, e.g., a robot, sensors could be added or removed at any time. Thus, it is crucial to enable the machine to tackle the mismatch and unequal-scale problems of modality combinations between training and inference. In this paper, we tackle these problems from a new perspective: "Modalities Help Modalities". Intuitively, we present OmniBind, a novel two-stage learning framework that can achieve any modality combinations and interaction. It involves teaching data-constrained, a.k.a, student, modalities to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities. This subtly enables the adaptive fusion of any modalities to build a unified representation space for any combinations. Specifically, we propose Cross-modal Alignment Distillation (CAD) to address the unequal-scale problem between student and teacher modalities and effectively align student modalities into the teacher modalities' representation space in stage one. We then propose an Adaptive Fusion (AF) module to fuse any modality combinations and learn a unified representation space in stage two. To address the mismatch problem, we aggregate existing datasets and combine samples from different modalities by the same semantics. This way, we build the first dataset for training and evaluation that consists of teacher (image, text) and student (touch, thermal, event, point cloud, audio) modalities and enables omni-bind for any of them. Extensive experiments on the recognition task show performance gains over prior arts by an average of 4.05 % on the arbitrary modality combination setting. It also achieves state-of-the-art performance for a single modality, e.g., touch, with a 4.34 % gain.
Paper Structure (21 sections, 9 equations, 8 figures, 13 tables)

This paper contains 21 sections, 9 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The proposed OmniBind: a novel two-stage learning framework that can achieve any modality combinations and interaction. (a): The ability that accepts input of any modality combinations. (b): The performance of our OmniBind. (c): Modality-free dataset.
  • Figure 2: The overall framework of OmniBind. We propose a two-stage training approach. Training stage I: Aligning the student modalities via CAD module; Training stage II: Learning the unified representation space for any modality combination via AF module.
  • Figure 3: The Adaptive Fusion module. (a) The framework of our proposed AF module; (b) The details of the classification operation in the AF module.
  • Figure 4: Overview of the modality-free dataset.
  • Figure 5: The t-SNE visualization (a) without CAD and (b) with CAD. (c): the ablation study of the modality numbers.
  • ...and 3 more figures