Table of Contents
Fetching ...

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Junhyeok Choi, Sangwoo Mo, Minwoo Chae

TL;DR

This work proposes a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures, achieving state-of-the-art cross-architecture generalization.

Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

TL;DR

This work proposes a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures, achieving state-of-the-art cross-architecture generalization.

Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.
Paper Structure (28 sections, 3 equations, 10 figures, 19 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 10 figures, 19 tables, 1 algorithm.

Figures (10)

  • Figure 1: Concept. Optimization-based methods are computationally expensive and architecture-dependent. Our learning-free framework is simple, efficient, and generalizable across architectures.
  • Figure 2: Our PDS framework distills multimodal datasets by synthesizing samples from image-text prototypes through three stages: (i) modality-specific clustering of CLIP embeddings, (ii) cross-modal cluster matching via a linear assignment to obtain prototypes, and (iii) image synthesis using an unCLIP decoder guided by image prototype and caption embeddings.
  • Figure 3: The synthesized images. Left (col. 1-3): Images from multimodal dataset distillation baselines, which are nearly identical to the initialization image. Middle: Image generated via CLIP inversion from image prototypes, which is not realistic. Right (col. 5-7): Given a text prototype, we first retrieve the most similar caption and present three images: the paired real image, the UnCLIP-generated image from this caption, and the PDS-generated image. Unlike UnCLIP, which strictly follows captions, PDS produces realistic and semantically enriched images by conditioning on image prototypes.
  • Figure 4: Histograms of cosine similarity between paired image and text prototypes in the 100-pair setting on Flickr30K, shown with and without filtering. Filtering leads to higher similarity.
  • Figure 5: Histograms of cosine similarity between paired image and text prototypes in the 1000-pair setting on Flickr30K, shown for pairless clusters and clusters with shared pairs. Pairless clusters exhibit lower similarity.
  • ...and 5 more figures