Table of Contents
Fetching ...

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Xin Zhang, Ziruo Zhang, Jiawei Du, Zuozhu Liu, Joey Tianyi Zhou

TL;DR

This work identifies modality collapse as a key bottleneck in multimodal dataset distillation, arising from the tension between extreme data condensation and cross-modal contrastive supervision. It introduces RepBlend, combining Representation Blending to boost intra-modal diversity with Symmetric Projection Trajectory Matching to harmonize optimization across modalities. Empirical results on Flickr-30K and MS-COCO show substantial retrieval gains and up to a $6.7\times$ speedup, with strong generalization across architectures and modalities. The approach advances efficient, balanced multimodal distillation with practical impact on cross-modal retrieval tasks and beyond.

Abstract

Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit{\textbf{Modality Collapse}}, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbf{RepBlend}, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7$\times$ distillation speedup.

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

TL;DR

This work identifies modality collapse as a key bottleneck in multimodal dataset distillation, arising from the tension between extreme data condensation and cross-modal contrastive supervision. It introduces RepBlend, combining Representation Blending to boost intra-modal diversity with Symmetric Projection Trajectory Matching to harmonize optimization across modalities. Empirical results on Flickr-30K and MS-COCO show substantial retrieval gains and up to a speedup, with strong generalization across architectures and modalities. The approach advances efficient, balanced multimodal distillation with practical impact on cross-modal retrieval tasks and beyond.

Abstract

Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit{\textbf{Modality Collapse}}, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbf{RepBlend}, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7 distillation speedup.

Paper Structure

This paper contains 22 sections, 20 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Multimodal embedding distributions across various distillation methods. We extract image and text embeddings from a finetuned CLIP radford21a and project them into a shared representation space using DOSNES lu2019doubly. Red triangles and blue circles denote image and text embeddings, respectively. Left: Embeddings from randomly sampled data in the original dataset exhibit a well-spread and modality-aligned distribution. Middle: The distilled dataset generated by a sota MDD method (LoRS xu2024lors) leads to Modality Collapse, where image and text embeddings are poorly aligned and concentrated in distinct regions. Right: Our method effectively mitigates modality collapse, yielding a distribution that better preserves cross-modal alignment and exhibits greater representational diversity.
  • Figure 2: Left: Increasing intra-modal similarity as distillation progresses. We run optimization for 3000 iterations and track the intra-modal cosine similarity, which increases from 0.512 to 0.522 (red curve). Though small in magnitude, this rise leads to a more than twofold increase in concentration ratio (CR) due to the high dimensionality of the embedding space. Right: Modality collapse undermines the effectiveness of learned soft cross-modal correspondence. The non-matching image-text pairs exhibit nearly uniform similarity scores, forming horizontal and vertical stripes.
  • Figure 3: As the noise level $\lambda$ increases, intra-modal similarity (blue bars) shows a slight decline, while the modality gap (yellow bars) rises markedly. In contrast, our representation blending (RB) leverages in-distribution samples to simultaneously reduce intra-modal similarity and inter-modal gap, effectively mitigating modality collapse during distillation.
  • Figure 4: Current MDD methods adopt asymmetric distillation. Left: The loss on the image side shows much smaller variation than that of the text side, fluctuating mildly around 1.0 without notable reduction. Right: The update norm relative to initialization is significantly lower for the image modality in LoRS (0.69) compared to the text modality (0.90), suggesting insufficient representation transfer. The update norm is computed in the shared representation space for both modalities. After incorporating symmetric matching (SM), both image and text modalities exhibit more balanced and synchronized update dynamics, leading to more effective cross-modal alignment (reduced Gap).
  • Figure 5: Ablation study of Representation Blending (RB) and Symmetric Projection Trajectory Matching (SM) on Flickr-30K with NFNet+BERT.
  • ...and 5 more figures