Table of Contents
Fetching ...

Vision-Language Dataset Distillation

Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky

TL;DR

This work introduces the first vision-language dataset distillation framework that distills multimodal (image-text) data by matching the training trajectories of models trained on full data. It combines bi-trajectory matching with Low-Rank Adaptation (LoRA) to scale distillation to large vision-language models, using a bidirectional contrastive objective to align image-text representations. Empirically, the method substantially outperforms adapted coreset baselines on Flickr30K and COCO and achieves strong retrieval performance with an order of magnitude fewer distilled examples. The results highlight the potential of multimodal trajectory-based distillation for compact, efficient cross-modal learning and set the stage for further exploration of minimal-information requirements in vision-language systems.

Abstract

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.

Vision-Language Dataset Distillation

TL;DR

This work introduces the first vision-language dataset distillation framework that distills multimodal (image-text) data by matching the training trajectories of models trained on full data. It combines bi-trajectory matching with Low-Rank Adaptation (LoRA) to scale distillation to large vision-language models, using a bidirectional contrastive objective to align image-text representations. Empirically, the method substantially outperforms adapted coreset baselines on Flickr30K and COCO and achieves strong retrieval performance with an order of magnitude fewer distilled examples. The results highlight the potential of multimodal trajectory-based distillation for compact, efficient cross-modal learning and set the stage for further exploration of minimal-information requirements in vision-language systems.

Abstract

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily vision-language datasets. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.
Paper Structure (24 sections, 4 equations, 13 figures, 14 tables, 1 algorithm)

This paper contains 24 sections, 4 equations, 13 figures, 14 tables, 1 algorithm.

Figures (13)

  • Figure 1: Dataset Distillation Comparison. (Left) Prior dataset distillation methods wang2018datasetcazenavette2022datasetnguyen2020dataset are class-specific: they distill the key information for each individual discrete class. (Center) Even the recently developed method deng2022remember, which enables information sharing between classes through learned bases, still assumes a discrete set of classes. (Right) In contrast, we set out to distill vision-language datasets with no discrete classes; we do so via a novel method which jointly distills images and texts.
  • Figure 2: Vision-Language Dataset Distillation. Both the image and text encoders are pretrained and followed by a trainable projection layer, and the text encoder is frozen. We use contrastive loss to measure the distance between the paired image-text embeddings, which influences the trajectory updates during distillation. The right panel shows how the distilled data aligns its training trajectory with the expert's, from a random starting point on the expert trajectory. The distilled dataset is updated based on bi-trajectory matching loss between the student and expert parameter trajectories.
  • Figure 3: Before and After Distillation.(Left) The image-text pairs before the distillation. (Right) The image-text pairs after 2000 distillation steps. Note that the texts visualized here are the nearest sentence decodings in the training set of the distilled text embeddings.
  • Figure 4: Distilled Images, iteration = 7000, lr image = 5000.
  • Figure 5: Original Images, iteration = 0.
  • ...and 8 more figures