Towards Mitigating Architecture Overfitting on Distilled Datasets
Xuyang Zhong, Chen Liu
TL;DR
This work tackles architecture overfitting in dataset distillation, where distilled data crafted for a specific training architecture fails to generalize to other architectures. It introduces smoothing-driven methods that treat the larger test model as an implicit ensemble of sub-networks via DropPath and regularize sub-networks with knowledge distillation from a smaller teacher, complemented by three-phase keep-rate scheduling, improved shortcuts, and stronger augmentation. Across multiple dataset distillation methods and datasets, the approach significantly reduces cross-architecture gaps and even yields superior performance when test networks are larger than the training network. The findings highlight improved transferability of distilled datasets and suggest practical benefits for training with limited real data as well. Overall, the paper advances cross-architecture robustness in dataset distillation through plug-and-play, smoothing-based techniques with broad applicability.
Abstract
Dataset distillation methods have demonstrated remarkable performance for neural networks trained with very limited training data. However, a significant challenge arises in the form of \textit{architecture overfitting}: the distilled training dataset synthesized by a specific network architecture (i.e., training network) generates poor performance when trained by other network architectures (i.e., test networks), especially when the test networks have a larger capacity than the training network. This paper introduces a series of approaches to mitigate this issue. Among them, DropPath renders the large model to be an implicit ensemble of its sub-networks, and knowledge distillation ensures each sub-network acts similarly to the small but well-performing teacher network. These methods, characterized by their smoothing effects, significantly mitigate architecture overfitting. We conduct extensive experiments to demonstrate the effectiveness and generality of our methods. Particularly, across various scenarios involving different tasks and different sizes of distilled data, our approaches significantly mitigate architecture overfitting. Furthermore, our approaches achieve comparable or even superior performance when the test network is larger than the training network.
