Table of Contents
Fetching ...

ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan, Yongliang Miao, Lijin Yang, Linfeng Zhang

TL;DR

ImageBindDC introduces a multimodal data condensation framework that operates in a unified ImageBind embedding space and uses a Characteristic Function Distance loss to achieve exact moment alignment between real and synthetic data. By enforcing uni-modal, cross-modal, and joint-modal distribution consistency, it preserves intricate inter-modal relationships that prior uni-modal condensations fail to capture. Empirical results across VGGS-10K, AVE, NYU-v2, and Clotho demonstrate state-of-the-art performance with significantly reduced condensation time and data requirements, including near-full performance on NYU-v2 with a small DPC. The approach offers a practical, kernel-free, Fourier-domain distribution-matching paradigm that scales to complex multimodal data, enabling efficient training of large models on resource-constrained hardware.

Abstract

Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2\% absolute improvement over the previous best method and more than 4$\times$ less condensation time.

ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation

TL;DR

ImageBindDC introduces a multimodal data condensation framework that operates in a unified ImageBind embedding space and uses a Characteristic Function Distance loss to achieve exact moment alignment between real and synthetic data. By enforcing uni-modal, cross-modal, and joint-modal distribution consistency, it preserves intricate inter-modal relationships that prior uni-modal condensations fail to capture. Empirical results across VGGS-10K, AVE, NYU-v2, and Clotho demonstrate state-of-the-art performance with significantly reduced condensation time and data requirements, including near-full performance on NYU-v2 with a small DPC. The approach offers a practical, kernel-free, Fourier-domain distribution-matching paradigm that scales to complex multimodal data, enabling efficient training of large models on resource-constrained hardware.

Abstract

Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2\% absolute improvement over the previous best method and more than 4 less condensation time.

Paper Structure

This paper contains 20 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A Comparison of multi-modal Data Condensation Paradigms. (Top) Separate Matching: Conventional methods condense each modality (e.g., vision, audio) independently, often using heuristic metrics like MMD. This preserves uni-modal statistics but critically fails to capture the cross-modal relationships that link the data together. (Bottom) Joint Matching: Our proposed framework, ImageBindDC, performs joint matching of all modalities simultaneously within a unified feature space. By using a principled metric like Characteristic Function Distance (CFD), our approach preserves the complete multi-modal data structure, ensuring the synthesized data is semantically coherent.
  • Figure 2: Overview of the ImageBindDC Framework. Our method condenses multi-modal data by performing principled distribution matching in a unified embedding space. (a) Data Condensation Pipeline: We take real multi-modal data, consisting of vision ($\mathcal{R}_v$) and audio ($\mathcal{R}_a$), and aim to synthesize a much smaller synthetic dataset ($\mathcal{S}_v, \mathcal{S}_a$). Both real and synthetic data are projected into a joint embedding space using the pretrained ImageBind encoder. The core of our method is to optimize the synthetic data such that its distribution in this embedding space matches that of the real data. (b) Characteristic Function Discrepancy (CFD): We use CFD as our distribution matching metric. The empirical Characteristic Function (CF) of a data distribution is calculated, which provides a summary in the Fourier domain (visualized here on the complex plane via polar plots). CFD then measures the discrepancy between the CFs of the real and synthetic embeddings, effectively matching all statistical moments for a precise alignment. (c) Multi-modal Distribution Matching Objective: To ensure comprehensive alignment, our final loss is a sum of three CFD-based objectives: (i) Uni-modal alignment preserves the integrity of each modality by matching real and synthetic data within the same modality (e.g., $\mathcal{R}_v$ vs. $\mathcal{S}_v$). (ii) Cross-modal alignment preserves the semantic relationship between modalities by matching the distribution of hybrid pairs (e.g., real audio + synthetic vision vs. real vision + synthetic audio). (iii) Joint-modal alignment captures the complete data structure by matching the joint distribution of paired real data against paired synthetic data. The total loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{uni}} + \mathcal{L}_{\text{cross}} + \mathcal{L}_{\text{joint}}$ guides the synthesis process.
  • Figure 3: DM (MMD) vs. ImageBindDC (CF) Performance Comparison. The figure illustrates the accuracy of the MMD and CF methods under (a) 1 DPC and (b) 10 DPC. In all matching configurations, including Video-only, Audio-only, and combined Video + Audio, ImageBindDC demonstrates superior performance over DM.
  • Figure 4: Ablation on Different Matching Objectives. This figure illustrates the contribution of uni-modal, joint-modal, and cross-modal matching objectives to overall accuracy. Results are presented for both (a) 1 DPC and (b) 10 DPC. The full configuration (ImageBindDC), combining all objectives, yields the best performance.
  • Figure 5: Qualitative comparison of distilled image samples on NYU-v2 dataset. Images distilled by ImageBindDC demonstrate a superior ability to preserve the core visual coherence of the original data across all categories.
  • ...and 1 more figures