Table of Contents
Fetching ...

Subnet-Aware Dynamic Supernet Training for Neural Architecture Search

Jeimin Jeon, Youngmin Oh, Junghyup Lee, Donghyeon Baek, Dohyung Kim, Chanho Eom, Bumsub Ham

TL;DR

This work tackles two core problems in N-shot NAS: unfairness toward high-complexity subnets and noisy momentum from shared optimizers. It introduces CaLR, a complexity-aware LR scheduler, and MS, a momentum separation strategy that clusters subnets by structure and uses cluster-specific momentum buffers, to stabilize training and improve subnet ranking. Across NAS-Bench-201 and MobileNet spaces on CIFAR-10/100 and ImageNet, CaLR+MS consistently improves Kendall's Tau ranking and retrieved subnet accuracy with negligible overhead, and it is compatible as a plug-in with SPOS, FairNAS, and FSNAS. The approach provides a practical, generalizable enhancement to dynamic supernet training, advancing reliable NAS with minimal computational burden.

Abstract

N-shot neural architecture search (NAS) exploits a supernet containing all candidate subnets for a given search space. The subnets are typically trained with a static training strategy (e.g., using the same learning rate (LR) scheduler and optimizer for all subnets). This, however, does not consider that individual subnets have distinct characteristics, leading to two problems: (1) The supernet training is biased towards the low-complexity subnets (unfairness); (2) the momentum update in the supernet is noisy (noisy momentum). We present a dynamic supernet training technique to address these problems by adjusting the training strategy adaptive to the subnets. Specifically, we introduce a complexity-aware LR scheduler (CaLR) that controls the decay ratio of LR adaptive to the complexities of subnets, which alleviates the unfairness problem. We also present a momentum separation technique (MS). It groups the subnets with similar structural characteristics and uses a separate momentum for each group, avoiding the noisy momentum problem. Our approach can be applicable to various N-shot NAS methods with marginal cost, while improving the search performance drastically. We validate the effectiveness of our approach on various search spaces (e.g., NAS-Bench-201, Mobilenet spaces) and datasets (e.g., CIFAR-10/100, ImageNet).

Subnet-Aware Dynamic Supernet Training for Neural Architecture Search

TL;DR

This work tackles two core problems in N-shot NAS: unfairness toward high-complexity subnets and noisy momentum from shared optimizers. It introduces CaLR, a complexity-aware LR scheduler, and MS, a momentum separation strategy that clusters subnets by structure and uses cluster-specific momentum buffers, to stabilize training and improve subnet ranking. Across NAS-Bench-201 and MobileNet spaces on CIFAR-10/100 and ImageNet, CaLR+MS consistently improves Kendall's Tau ranking and retrieved subnet accuracy with negligible overhead, and it is compatible as a plug-in with SPOS, FairNAS, and FSNAS. The approach provides a practical, generalizable enhancement to dynamic supernet training, advancing reliable NAS with minimal computational burden.

Abstract

N-shot neural architecture search (NAS) exploits a supernet containing all candidate subnets for a given search space. The subnets are typically trained with a static training strategy (e.g., using the same learning rate (LR) scheduler and optimizer for all subnets). This, however, does not consider that individual subnets have distinct characteristics, leading to two problems: (1) The supernet training is biased towards the low-complexity subnets (unfairness); (2) the momentum update in the supernet is noisy (noisy momentum). We present a dynamic supernet training technique to address these problems by adjusting the training strategy adaptive to the subnets. Specifically, we introduce a complexity-aware LR scheduler (CaLR) that controls the decay ratio of LR adaptive to the complexities of subnets, which alleviates the unfairness problem. We also present a momentum separation technique (MS). It groups the subnets with similar structural characteristics and uses a separate momentum for each group, avoiding the noisy momentum problem. Our approach can be applicable to various N-shot NAS methods with marginal cost, while improving the search performance drastically. We validate the effectiveness of our approach on various search spaces (e.g., NAS-Bench-201, Mobilenet spaces) and datasets (e.g., CIFAR-10/100, ImageNet).

Paper Structure

This paper contains 43 sections, 11 equations, 12 figures, 19 tables.

Figures (12)

  • Figure 1: Illustrations of the challenges of N-shot NAS methods. (a) We visualize validation losses for the subnets having different complexities at training time. Existing methods do not consider the distinct optimization speed of subnets w.r.t. complexities. This causes an unfairness problem, where the high-complexity subnet is trained insufficiently, and the predicted performance falls behind the low-complexity one, even if it might be supposed to provide better performance. (b) We illustrate gradients $g^t$ of subnets and the momentum $\mu^t$ at $t$-th iteration. We can see that the gradients vary according to the subnets, resulting in a noisy momentum and preventing a stable training process. (Best viewed in color.)
  • Figure 2: Empirical comparisons of SPOS guo2020single and SPOS with our dynamic training strategy. We train a supernet using the NAS-Bench-201 search space dong2020bench on CIFAR-10 krizhevsky2009cifar. (a-b) Validation accuracies for three subnets sampled from the supernet using SPOS guo2020single without and with CaLR. Note that the sampled subnets have different complexities and ground-truth accuracies. (c) Plots of gradient consistency in terms of the standard deviation of gradients lu2023pa. A smaller value indicates more consistent gradient direction over the training iterations. (d) Comparisons of various methods in terms of ranking consistency of the supernet, using the Kendall's Tau kendall1948rank.
  • Figure 3: (a) Plots of LRs by CaLR with varying the decay ratio of $\gamma(\alpha)$. CaLR sets a small decay ratio (i.e., a large LR) for high-complexity networks, and vice versa. (b) Visualization of the decay ratio $\gamma(\alpha)$ based on the complexity score $C(\alpha)$.
  • Figure 4: (a) The supernet shares weights and momentum for all subnets. (b) MS selects a single edge (or layer) from the supernet and clusters the subnets according to operations for the edge. It then assigns a distinct momentum buffer for each cluster, while the weights are shared for all clusters.
  • Figure 5: Plots of standard deviations for gradients and momentums of the supernet on CIFAR-10 krizhevsky2009cifar in NAS-Bench-201 dong2020bench.
  • ...and 7 more figures