Table of Contents
Fetching ...

FocusDD: Real-World Scene Infusion for Robust Dataset Distillation

Youbing Hu, Yun Cheng, Olga Saukh, Firat Ozdemir, Anqi Lu, Zhiqiang Cao, Zhijun Li

TL;DR

FocusDD tackles the practicality gap of dataset distillation on large-scale, high-resolution data by introducing an attention-guided, patch-based distillation pipeline. A pre-trained Vision Transformer selects informative foreground patches and, together with contextual background, reconstructs realistic, compact images that support both classification and object detection. The method demonstrates state-of-the-art performance on ImageNet-1K across multiple architectures and achieves first-of-its-kind results for distilled data in object detection on COCO2017, with further gains when combined with Dynamic Fine-Tuning. This approach offers cross-architecture generalization, reduced computational cost via downsampling, and practical applicability to dense tasks, marking a significant advance in data-efficient training for large-scale vision systems.

Abstract

Dataset distillation has emerged as a strategy to compress real-world datasets for efficient training. However, it struggles with large-scale and high-resolution datasets, limiting its practicality. This paper introduces a novel resolution-independent dataset distillation method Focus ed Dataset Distillation (FocusDD), which achieves diversity and realism in distilled data by identifying key information patches, thereby ensuring the generalization capability of the distilled dataset across different network architectures. Specifically, FocusDD leverages a pre-trained Vision Transformer (ViT) to extract key image patches, which are then synthesized into a single distilled image. These distilled images, which capture multiple targets, are suitable not only for classification tasks but also for dense tasks such as object detection. To further improve the generalization of the distilled dataset, each synthesized image is augmented with a downsampled view of the original image. Experimental results on the ImageNet-1K dataset demonstrate that, with 100 images per class (IPC), ResNet50 and MobileNet-v2 achieve validation accuracies of 71.0% and 62.6%, respectively, outperforming state-of-the-art methods by 2.8% and 4.7%. Notably, FocusDD is the first method to use distilled datasets for object detection tasks. On the COCO2017 dataset, with an IPC of 50, YOLOv11n and YOLOv11s achieve 24.4% and 32.1% mAP, respectively, further validating the effectiveness of our approach.

FocusDD: Real-World Scene Infusion for Robust Dataset Distillation

TL;DR

FocusDD tackles the practicality gap of dataset distillation on large-scale, high-resolution data by introducing an attention-guided, patch-based distillation pipeline. A pre-trained Vision Transformer selects informative foreground patches and, together with contextual background, reconstructs realistic, compact images that support both classification and object detection. The method demonstrates state-of-the-art performance on ImageNet-1K across multiple architectures and achieves first-of-its-kind results for distilled data in object detection on COCO2017, with further gains when combined with Dynamic Fine-Tuning. This approach offers cross-architecture generalization, reduced computational cost via downsampling, and practical applicability to dense tasks, marking a significant advance in data-efficient training for large-scale vision systems.

Abstract

Dataset distillation has emerged as a strategy to compress real-world datasets for efficient training. However, it struggles with large-scale and high-resolution datasets, limiting its practicality. This paper introduces a novel resolution-independent dataset distillation method Focus ed Dataset Distillation (FocusDD), which achieves diversity and realism in distilled data by identifying key information patches, thereby ensuring the generalization capability of the distilled dataset across different network architectures. Specifically, FocusDD leverages a pre-trained Vision Transformer (ViT) to extract key image patches, which are then synthesized into a single distilled image. These distilled images, which capture multiple targets, are suitable not only for classification tasks but also for dense tasks such as object detection. To further improve the generalization of the distilled dataset, each synthesized image is augmented with a downsampled view of the original image. Experimental results on the ImageNet-1K dataset demonstrate that, with 100 images per class (IPC), ResNet50 and MobileNet-v2 achieve validation accuracies of 71.0% and 62.6%, respectively, outperforming state-of-the-art methods by 2.8% and 4.7%. Notably, FocusDD is the first method to use distilled datasets for object detection tasks. On the COCO2017 dataset, with an IPC of 50, YOLOv11n and YOLOv11s achieve 24.4% and 32.1% mAP, respectively, further validating the effectiveness of our approach.
Paper Structure (25 sections, 13 equations, 14 figures, 18 tables)

This paper contains 25 sections, 13 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: FocusDD performance on classification and detection tasks. Left: For classification with IPC=100, we use MobileNet-v2 sandler2018mobilenetv2 and ResNet-18 he2016deep as validation models to evaluate the ImageNet-1K deng2009imagenet validation set. SCDD SCDD, SRe$^{2}$L yin2024squeeze, and RDED sun2024diversity are the current SOTA methods. Right: In the detection task, we use YOLOv11 khanam2024yolov11 as the validation model to evaluate the COCO2017 lin2014microsoft validation set. FocusDD is the first method to explore dataset distillation for object detection tasks.
  • Figure 2: Visualization of the FocusDD-distilled images on different tasks. Left: Visualization of training samples for object detection using FocusDD-distilled images. Using YOLOv11x khanam2024yolov11 as the teacher model, soft supervision is applied to train YOLOv11n and YOLOv11s, tested on the COCO2017 lin2014microsoft validation set. The numbers in each image correspond to COCO categories. Right: Visualization of training samples for classification using FocusDD-distilled images. Soft supervision with ResNet-18 he2016deep as the teacher guides ResNet-18 and MobileNet-v2 sandler2018mobilenetv2 training, tested on the ImageNet-1K deng2009imagenet validation set. The performance is shown in Fig. \ref{['fig:1']}.
  • Figure 3: Overview of the FocusDD framework. FocusDD comprises two main stages: information extraction and image reconstruction. In the information extraction stage, a pre-trained ViT model guides the selection of key patches, identifying those containing key patches and representative real images with background details. During the image reconstruction stage, these patches are combined with images rich in background information to reconstruct a compiled, realistic image. Subsequently, these images are relabelled using a model with the same architecture as the validation model.
  • Figure 4: The FocusDD process of selecting key image patches. Downsampling greatly reduces the computational cost of dataset distillation (see Table \ref{['tab:com_cost']} in the Appendix \ref{['appendix:dft_2']}) and allows the direct use of downsampled images to improve the generalization performance of the synthesized dataset (see Table \ref{['fig:1']}).
  • Figure 5: Model accuracy with varying epoch and resolution. Left: Accuracy changes with training epochs using ResNet-18 as the validation model in the IPC-10 setting. Right: The impact of the image resolution of synthetic dataset on model accuracy.
  • ...and 9 more figures