Table of Contents
Fetching ...

Low-Rank Similarity Mining for Multimodal Dataset Distillation

Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, Yong-Lu Li

TL;DR

This work tackles multimodal dataset distillation for image–text pairs, where preserving cross‑modal correspondence is crucial yet challenging due to high variance and lack of inherent categories. It introduces Low Rank Similarity Mining (LoRS), which jointly distills a ground‑truth similarity matrix and synthetic data, employing a low‑rank factorization $\tilde{S}=\omega I+\frac{\alpha}{r}LR^\top$ to keep memory usage linear in data size. The method extends ITC losses with continuous similarity forms (eNCE, BCE, wBCE) to learn $\tilde{S}$ alongside synthetic data, and justifies this via false negative mining and flexible contrastive anchors. Empirically, LoRS yields substantial improvements over baselines on Flickr30k and COCO, demonstrates cross‑architecture generalization, and maintains only minimal overhead, suggesting it can become a foundational synthetic data setup for visual‑language distillation.

Abstract

Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https://github.com/silicx/LoRS_Distill.

Low-Rank Similarity Mining for Multimodal Dataset Distillation

TL;DR

This work tackles multimodal dataset distillation for image–text pairs, where preserving cross‑modal correspondence is crucial yet challenging due to high variance and lack of inherent categories. It introduces Low Rank Similarity Mining (LoRS), which jointly distills a ground‑truth similarity matrix and synthetic data, employing a low‑rank factorization to keep memory usage linear in data size. The method extends ITC losses with continuous similarity forms (eNCE, BCE, wBCE) to learn alongside synthetic data, and justifies this via false negative mining and flexible contrastive anchors. Empirically, LoRS yields substantial improvements over baselines on Flickr30k and COCO, demonstrates cross‑architecture generalization, and maintains only minimal overhead, suggesting it can become a foundational synthetic data setup for visual‑language distillation.

Abstract

Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https://github.com/silicx/LoRS_Distill.
Paper Structure (33 sections, 3 theorems, 21 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 33 sections, 3 theorems, 21 equations, 10 figures, 11 tables, 2 algorithms.

Key Result

Proposition 3.1

The gradient of InfoNCE loss wrt. the image representation $u_n$ is:

Figures (10)

  • Figure 1: Vanilla dataset distillation could be adapted to image-text data but is limited by the fixed data pairing ("Baseline"). We propose similarity mining which simultaneously distills the ground truth similarity matrix, together with low-rank optimization for a fair data parameter size (LoRS). ($I_i$ = $i^\text{th}$ image, $T_i$ = $i^\text{th}$ text).
  • Figure 2: The image feature variance on different datasets. We adopt CLIP encoders with ResNet or ViT, pretrained on LAION or YFCC datasets. Three datasets at left: image-text datasets; seven at right: classification datasets.
  • Figure 3: The histogram of the similarity value learned by similarity mining. False negatives are deliberately constructed and can be found by the algorithm.
  • Figure 4: (a) Training dynamical system of a representation: it is attracted or repelled by the anchors. (b) If the anchor is flexibly weighted, dynamics could be equivalent to a system that has fewer components, and similarity mining could offer this flexibility.
  • Figure 5: Computation graph of the proposed method LoRS. The green nodes are part of the learnable synthetic dataset.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3