Table of Contents
Fetching ...

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

Haizhong Zheng, Jiachen Sun, Shutong Wu, Bhavya Kailkhura, Zhuoqing Mao, Chaowei Xiao, Atul Prakash

TL;DR

This work addresses data condensation by introducing Hierarchical Memory Network (HMN), a three-tier memory data container that captures dataset-, class-, and instance-level features to synthesize compact training data via data parameterization. HMN uses a uniform decoder and per-class feature extractors to generate synthetic images, enabling efficient information sharing and straightforward instance-level pruning. With batch-based gradient matching, HMN consistently outperforms state-of-the-art baselines across five public datasets under various IPC budgets, while also enabling cross-architecture transferability and continual-learning gains. The authors further propose over-budget condensation with a double-end pruning strategy guided by AUM to reduce redundancy, achieving storage savings with minimal overhead and practical applicability.

Abstract

Given a real-world dataset, data condensation (DC) aims to synthesize a small synthetic dataset that captures the knowledge of a natural dataset while being usable for training models with comparable accuracy. Recent works propose to enhance DC with data parameterization, which condenses data into very compact parameterized data containers instead of images. The intuition behind data parameterization is to encode shared features of images to avoid additional storage costs. In this paper, we recognize that images share common features in a hierarchical way due to the inherent hierarchical structure of the classification system, which is overlooked by current data parameterization methods. To better align DC with this hierarchical nature and encourage more efficient information sharing inside data containers, we propose a novel data parameterization architecture, Hierarchical Memory Network (HMN). HMN stores condensed data in a three-tier structure, representing the dataset-level, class-level, and instance-level features. Another helpful property of the hierarchical architecture is that HMN naturally ensures good independence among images despite achieving information sharing. This enables instance-level pruning for HMN to reduce redundant information, thereby further minimizing redundancy and enhancing performance. We evaluate HMN on five public datasets and show that our proposed method outperforms all baselines.

Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation

TL;DR

This work addresses data condensation by introducing Hierarchical Memory Network (HMN), a three-tier memory data container that captures dataset-, class-, and instance-level features to synthesize compact training data via data parameterization. HMN uses a uniform decoder and per-class feature extractors to generate synthetic images, enabling efficient information sharing and straightforward instance-level pruning. With batch-based gradient matching, HMN consistently outperforms state-of-the-art baselines across five public datasets under various IPC budgets, while also enabling cross-architecture transferability and continual-learning gains. The authors further propose over-budget condensation with a double-end pruning strategy guided by AUM to reduce redundancy, achieving storage savings with minimal overhead and practical applicability.

Abstract

Given a real-world dataset, data condensation (DC) aims to synthesize a small synthetic dataset that captures the knowledge of a natural dataset while being usable for training models with comparable accuracy. Recent works propose to enhance DC with data parameterization, which condenses data into very compact parameterized data containers instead of images. The intuition behind data parameterization is to encode shared features of images to avoid additional storage costs. In this paper, we recognize that images share common features in a hierarchical way due to the inherent hierarchical structure of the classification system, which is overlooked by current data parameterization methods. To better align DC with this hierarchical nature and encourage more efficient information sharing inside data containers, we propose a novel data parameterization architecture, Hierarchical Memory Network (HMN). HMN stores condensed data in a three-tier structure, representing the dataset-level, class-level, and instance-level features. Another helpful property of the hierarchical architecture is that HMN naturally ensures good independence among images despite achieving information sharing. This enables instance-level pruning for HMN to reduce redundant information, thereby further minimizing redundancy and enhancing performance. We evaluate HMN on five public datasets and show that our proposed method outperforms all baselines.
Paper Structure (26 sections, 4 equations, 19 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: Illustration of data condensation with HMN. Like other data parameterization methods, HMN is a data container using a small storage budget and can generate images for training.
  • Figure 2: Illustration of Hierarchical Memory Network and pruning. HMN consists of three tiers of memories (which are learnable parameters). $f_i$ is the feature extractor for each class. $D$ is a single shared decoder to translate a concatenated memory to a synthetic image, though it is applied on a per-image basis, as shown. When we identify redundant or detrimental images, the corresponding instance-level memories are pruned, as indicated by red boxes, saving storage budget.
  • Figure 3: Rank distribution for different basis vectors in HaBa for CIFAR10 10 IPC. Each column in this figure represents the difficulty rank of images generated using the same basis vector. The color stands for the difficulty rank among all generated images. Green denotes easy-to-learn images, while red indicates hard-to-learn images.
  • Figure 4: Continual learning evaluation on CIFAR10. In the class incremental setting with 2 incoming classes per stage, HMN outperforms existing methods (including DSA, DM and IDC) under different storage budgets.
  • Figure 5: Instance-memory length vs. Accuracy for CIFAR10 HMNs with 1 IPC/10 IPC storage budgets. GIPC refers to the number of generated images per class. The solid and dashed curves represent the accuracy and GIPC, respectively.
  • ...and 14 more figures