Low-Rank Similarity Mining for Multimodal Dataset Distillation
Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, Yong-Lu Li
TL;DR
This work tackles multimodal dataset distillation for image–text pairs, where preserving cross‑modal correspondence is crucial yet challenging due to high variance and lack of inherent categories. It introduces Low Rank Similarity Mining (LoRS), which jointly distills a ground‑truth similarity matrix and synthetic data, employing a low‑rank factorization $\tilde{S}=\omega I+\frac{\alpha}{r}LR^\top$ to keep memory usage linear in data size. The method extends ITC losses with continuous similarity forms (eNCE, BCE, wBCE) to learn $\tilde{S}$ alongside synthetic data, and justifies this via false negative mining and flexible contrastive anchors. Empirically, LoRS yields substantial improvements over baselines on Flickr30k and COCO, demonstrates cross‑architecture generalization, and maintains only minimal overhead, suggesting it can become a foundational synthetic data setup for visual‑language distillation.
Abstract
Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https://github.com/silicx/LoRS_Distill.
