Table of Contents
Fetching ...

ATOM: Attention Mixer for Efficient Dataset Distillation

Samir Khaki, Ahmad Sajedi, Kai Wang, Lucy Z. Liu, Yuri A. Lawryshyn, Konstantinos N. Plataniotis

TL;DR

ATtentiOn Mixer (ATOM) presents an efficient dataset distillation framework that fuses spatial localization and channel-wise contextual attention to synthesize compact datasets without bi-level optimization. By extracting intermediate-feature attention maps and combining them with a final-layer distribution-matching objective, ATOM achieves superior performance across CIFAR-10/100 and TinyImageNet, particularly at low IPCs, while maintaining cross-architecture generalization. The method demonstrates strong improvements over prior attention-based and feature-matching approaches, enables effective neural architecture search, and offers a practical, scalable pathway for efficient dataset distillation. Limitations include re-distillation costs for setting changes and reduced transformer generalization, suggesting directions for extending attention mixing to transformers and segmentation tasks.

Abstract

Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.

ATOM: Attention Mixer for Efficient Dataset Distillation

TL;DR

ATtentiOn Mixer (ATOM) presents an efficient dataset distillation framework that fuses spatial localization and channel-wise contextual attention to synthesize compact datasets without bi-level optimization. By extracting intermediate-feature attention maps and combining them with a final-layer distribution-matching objective, ATOM achieves superior performance across CIFAR-10/100 and TinyImageNet, particularly at low IPCs, while maintaining cross-architecture generalization. The method demonstrates strong improvements over prior attention-based and feature-matching approaches, enables effective neural architecture search, and offers a practical, scalable pathway for efficient dataset distillation. Limitations include re-distillation costs for setting changes and reduced transformer generalization, suggesting directions for extending attention mixing to transformers and segmentation tasks.

Abstract

Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.
Paper Structure (16 sections, 4 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: The ATOM Framework utilizes inherent information to capture both context and location, resulting in significantly improved performance in dataset distillation. We display the performance of various components within the ATOM framework, showcasing a $5.8\%$ enhancement from the base distribution matching performance on CIFAR10 at IPC50. Complete numerical details can be found in \ref{['tab:component']}.
  • Figure 2: (a) An overview of the proposed ATOM framework. By mixing attention, ATOM is able to capture both spatial localization and class context. (b) Demonstration of the internal architecture for spatial- and channel-wise attention in the ATOM Module. The spatial-wise attention computes attention at specific locales through different filters, resulting in a matrix output, whereas the channel-wise attention calculates attention between each filter, naturally producing a vectorized output.
  • Figure 3: Test accuracy evolution of synthetic image learning on CIFAR10 with IPC50 for ATOM (ours), DM zhao2023dataset and DataDAM sajedi2023datadam.
  • Figure 4: Sample learned synthetic images for CIFAR-10/100 (32$\times$32 resolution) IPC10 and TinyImageNet (64$\times$64 resolution) IPC 1.
  • Figure 5: Distilled Image Visualization: CIFAR-10 dataset with IPC 50.
  • ...and 2 more figures