Table of Contents
Fetching ...

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao

TL;DR

FreeBind addresses the challenge of improving a pre-trained unified multimodal space without retraining billion-parameter models or risking catastrophic forgetting. It introduces two basic bonds, space displacement and space combination, to fuse expert spaces into a frozen unified space, and augments these with Complex Sequential & Parallel Bonds and a coarse-to-fine inference strategy. The method uses pseudo datasets and a simple InfoNCE-based single-projector training, plus a mixture-of-projectors, to align and combine modalities, demonstrating state-of-the-art performance on audio-image-text tasks and even surpassing some specialized expert spaces under customized inference. The approach is computationally efficient, scalable to multiple experts, and offers flexible, task-specific customization with practical implications for rapid development of stronger, unified multimodal representations. ${A^u}{V^u}{T^u}$ + d(${V^{vt}}{T^{vt}}$) → … + c(${A^{at}}{T^{at}}$) → customized fused space; results on ImageBind show substantial improvements across retrieval and classification tasks with modest training cost.

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

TL;DR

FreeBind addresses the challenge of improving a pre-trained unified multimodal space without retraining billion-parameter models or risking catastrophic forgetting. It introduces two basic bonds, space displacement and space combination, to fuse expert spaces into a frozen unified space, and augments these with Complex Sequential & Parallel Bonds and a coarse-to-fine inference strategy. The method uses pseudo datasets and a simple InfoNCE-based single-projector training, plus a mixture-of-projectors, to align and combine modalities, demonstrating state-of-the-art performance on audio-image-text tasks and even surpassing some specialized expert spaces under customized inference. The approach is computationally efficient, scalable to multiple experts, and offers flexible, task-specific customization with practical implications for rapid development of stronger, unified multimodal representations. + d() → … + c() → customized fused space; results on ImageBind show substantial improvements across retrieval and classification tasks with modest training cost.

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.
Paper Structure (39 sections, 13 equations, 6 figures, 9 tables)

This paper contains 39 sections, 13 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: High-level overview of FreeBind. We propose two basic kinds of space bonds: space displacement bond and space combination bond, to efficiently augment unified space by integrating knowledge of extra expert spaces.
  • Figure 2: The pipeline of basic space displacement bond and space combination bond.
  • Figure 3: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on InternVL$_{I\!B}^\dagger$++.$\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between InternVL$_{I\!B}^\dagger$++ and InternVL$_{I\!B}^\dagger$ on audio-text, audio-image and image-text retrieval tasks, respectively. Positive $\Delta_{*}$ signifies improvements in the corresponding task, while negative values indicate reductions. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
  • Figure 4: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on InternVL$_{I\!B}$++. $\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between InternVL$_{I\!B}$++ and InternVL$_{I\!B}$ on audio-text, audio-image and image-text retrieval tasks, respectively. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
  • Figure 5: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on ImageBind++. $\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between ImageBind++ and ImageBind on audio-text, audio-image and image-text retrieval tasks, respectively. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
  • ...and 1 more figures