ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

Yue Min; Shaobo Wang; Jiaze Li; Tianle Niu; Junxin Fan; Yongliang Miao; Lijin Yang; Linfeng Zhang

ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan, Yongliang Miao, Lijin Yang, Linfeng Zhang

TL;DR

ImageBindDC introduces a multimodal data condensation framework that operates in a unified ImageBind embedding space and uses a Characteristic Function Distance loss to achieve exact moment alignment between real and synthetic data. By enforcing uni-modal, cross-modal, and joint-modal distribution consistency, it preserves intricate inter-modal relationships that prior uni-modal condensations fail to capture. Empirical results across VGGS-10K, AVE, NYU-v2, and Clotho demonstrate state-of-the-art performance with significantly reduced condensation time and data requirements, including near-full performance on NYU-v2 with a small DPC. The approach offers a practical, kernel-free, Fourier-domain distribution-matching paradigm that scales to complex multimodal data, enabling efficient training of large models on resource-constrained hardware.

Abstract

Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2\% absolute improvement over the previous best method and more than 4$\times$ less condensation time.

ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

TL;DR

Abstract

ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)