FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang; Ziang Zhang; Xize Cheng; Rongjie Huang; Luping Liu; Zhenhui Ye; Haifeng Huang; Yang Zhao; Tao Jin; Peng Gao; Zhou Zhao

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, Zhou Zhao

TL;DR

FreeBind addresses the challenge of improving a pre-trained unified multimodal space without retraining billion-parameter models or risking catastrophic forgetting. It introduces two basic bonds, space displacement and space combination, to fuse expert spaces into a frozen unified space, and augments these with Complex Sequential & Parallel Bonds and a coarse-to-fine inference strategy. The method uses pseudo datasets and a simple InfoNCE-based single-projector training, plus a mixture-of-projectors, to align and combine modalities, demonstrating state-of-the-art performance on audio-image-text tasks and even surpassing some specialized expert spaces under customized inference. The approach is computationally efficient, scalable to multiple experts, and offers flexible, task-specific customization with practical implications for rapid development of stronger, unified multimodal representations. ${A^u}{V^u}{T^u}$ + d(${V^{vt}}{T^{vt}}$) → … + c(${A^{at}}{T^{at}}$) → customized fused space; results on ImageBind show substantial improvements across retrieval and classification tasks with modest training cost.

Abstract

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

TL;DR

+ d(

) → … + c(

) → customized fused space; results on ImageBind show substantial improvements across retrieval and classification tasks with modest training cost.

Abstract

Paper Structure (39 sections, 13 equations, 6 figures, 9 tables)

This paper contains 39 sections, 13 equations, 6 figures, 9 tables.

Introduction
Related work
Multimodal Representation Space
Knowledge Fusion in Multimodal Representation
Method
Problem formulation
Basic Space Bonds
Pseudo Datasets Collection
Space Alignments
Single Projector Training
Mixture-of-Projectors Strategy
Inference
Complex Sequential & Parallel Bonds
Coarse-to-Fine Customized Inference
Experiment and Discussions
...and 24 more sections

Figures (6)

Figure 1: High-level overview of FreeBind. We propose two basic kinds of space bonds: space displacement bond and space combination bond, to efficiently augment unified space by integrating knowledge of extra expert spaces.
Figure 2: The pipeline of basic space displacement bond and space combination bond.
Figure 3: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on InternVL$_{I\!B}^\dagger$++.$\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between InternVL$_{I\!B}^\dagger$++ and InternVL$_{I\!B}^\dagger$ on audio-text, audio-image and image-text retrieval tasks, respectively. Positive $\Delta_{*}$ signifies improvements in the corresponding task, while negative values indicate reductions. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
Figure 4: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on InternVL$_{I\!B}$++. $\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between InternVL$_{I\!B}$++ and InternVL$_{I\!B}$ on audio-text, audio-image and image-text retrieval tasks, respectively. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
Figure 5: Analysis of CLAPs' combining factors ($\sigma_a, \sigma_t$) on ImageBind++. $\Delta_{AT}, \Delta_{A\!V}, \Delta_{TV}$ represents the average R@1 variance between ImageBind++ and ImageBind on audio-text, audio-image and image-text retrieval tasks, respectively. The gray plane in the 3D figure $a)$ denotes the audio-text performance of CLAP$_{g}$.
...and 1 more figures

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

TL;DR

Abstract

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)