Table of Contents
Fetching ...

Regularizing Differentiable Architecture Search with Smooth Activation

Yanlin Zhou, Mostafa El-Khamy, Kee-Bong Song

TL;DR

This work tackles robustness and generalization gaps in differentiable architecture search by introducing SA-DARTS, a regularization that embeds architecture weights $\boldsymbol{\alpha}$ inside a smooth activation function. The resulting method mitigates skip-dominance and the discretization mismatch between the supernet and the final one-hot architecture, while preserving or improving search efficiency via SAC-DARTS with partial-channel. The authors demonstrate state-of-the-art performance on NAS-Bench-201, CIFAR/ImageNet classification, and super-resolution tasks, and show that the approach yields better loss landscapes and more robust operator ranking. The contributions provide a principled, low-overhead path to more reliable neural architecture search with broad practical impact across vision tasks and beyond.

Abstract

Differentiable Architecture Search (DARTS) is an efficient Neural Architecture Search (NAS) method but suffers from robustness, generalization, and discrepancy issues. Many efforts have been made towards the performance collapse issue caused by skip dominance with various regularization techniques towards operation weights, path weights, noise injection, and super-network redesign. It had become questionable at a certain point if there could exist a better and more elegant way to retract the search to its intended goal -- NAS is a selection problem. In this paper, we undertake a simple but effective approach, named Smooth Activation DARTS (SA-DARTS), to overcome skip dominance and discretization discrepancy challenges. By leveraging a smooth activation function on architecture weights as an auxiliary loss, our SA-DARTS mitigates the unfair advantage of weight-free operations, converging to fanned-out architecture weight values, and can recover the search process from skip-dominance initialization. Through theoretical and empirical analysis, we demonstrate that the SA-DARTS can yield new state-of-the-art (SOTA) results on NAS-Bench-201, classification, and super-resolution. Further, we show that SA-DARTS can help improve the performance of SOTA models with fewer parameters, such as Information Multi-distillation Network on the super-resolution task.

Regularizing Differentiable Architecture Search with Smooth Activation

TL;DR

This work tackles robustness and generalization gaps in differentiable architecture search by introducing SA-DARTS, a regularization that embeds architecture weights inside a smooth activation function. The resulting method mitigates skip-dominance and the discretization mismatch between the supernet and the final one-hot architecture, while preserving or improving search efficiency via SAC-DARTS with partial-channel. The authors demonstrate state-of-the-art performance on NAS-Bench-201, CIFAR/ImageNet classification, and super-resolution tasks, and show that the approach yields better loss landscapes and more robust operator ranking. The contributions provide a principled, low-overhead path to more reliable neural architecture search with broad practical impact across vision tasks and beyond.

Abstract

Differentiable Architecture Search (DARTS) is an efficient Neural Architecture Search (NAS) method but suffers from robustness, generalization, and discrepancy issues. Many efforts have been made towards the performance collapse issue caused by skip dominance with various regularization techniques towards operation weights, path weights, noise injection, and super-network redesign. It had become questionable at a certain point if there could exist a better and more elegant way to retract the search to its intended goal -- NAS is a selection problem. In this paper, we undertake a simple but effective approach, named Smooth Activation DARTS (SA-DARTS), to overcome skip dominance and discretization discrepancy challenges. By leveraging a smooth activation function on architecture weights as an auxiliary loss, our SA-DARTS mitigates the unfair advantage of weight-free operations, converging to fanned-out architecture weight values, and can recover the search process from skip-dominance initialization. Through theoretical and empirical analysis, we demonstrate that the SA-DARTS can yield new state-of-the-art (SOTA) results on NAS-Bench-201, classification, and super-resolution. Further, we show that SA-DARTS can help improve the performance of SOTA models with fewer parameters, such as Information Multi-distillation Network on the super-resolution task.

Paper Structure

This paper contains 32 sections, 24 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overflow of SAC-DARTS. In addition to the cross-entropy loss, the auxiliary loss $L_{SA}$ is added as a regularization term. The DARTS-based cell is shown on the left. Within each cell, only a portion of the channels is selected for mixed operation, while unselected channels are directly passed to the next node. Edge weights are introduced to reduce stochastic incurred by channel sampling. At the final child network discretization step, top-2 incoming edges with the highest $\alpha$ values are kept.
  • Figure 2: Comparison between DARTS with L2 regularization ($\lambda_2 \sum^{|O|}_{k=1} \alpha_k^2$), Beta regularization ($\lambda_\beta \log(\sum^{|O|}_{k=1} e^{\alpha_k})$), and our Smooth Activation regularization of their $\beta$ (softmax of architecture weights $\alpha$) change vs. epochs on a same edge. While the original DARTS with L2 in (a) suffers from the curse of skip dominance, and the top-2 softmax values of Beta-DARTS in (b) are too close to identify a suitable candidate operation, our SA-DARTS (c) solves the skip dominance issue and achieves dispersed $\beta$ values and alleviates the discrepancy of discretization issue. The search is done on NAS-Bench-201.
  • Figure 3: Comparison between different regularization of DARTS: $L_2 = \lambda_2 \sum^{|O|}_{k=1} \alpha_k^2$ and our $L_{SA}=\dfrac{\lambda}{N} \sum_{i=1}^{N} \dfrac{(1+\nu)\alpha_i+(1-\nu)\alpha_i \mathop{\mathrm{erf}}\nolimits(\mu(1-\nu)\alpha_i)}{2}$ of their $\alpha$ mean vs. epoch, $\alpha$ median vs. epoch, and $\alpha$ standard deviations vs epochs. To solve skip dominance issue, our SA-DARTS drives $\alpha$ values to large negative numbers so their change in softmax is relatively smaller when compared to positive $\alpha$ values. The search is done with NAS-Bench-201.
  • Figure 4: $\beta$ value changes of the same edge with L2, Beta, and SA on DARTS at an unfair local optimal favoring skip-connection. We assign skip-connection with a higher probability. The first 15 epochs are for warm-up only. Our SAC-DARTS can recover from the unfair disadvantage and jump out of the local optimal. The search is done on NAS-Bench-201.
  • Figure 5: The visualization of validation accuracy and loss landscape with respect to architecture weights $\alpha$. Compared to original DARTS, our SA-DARTS smooths the landscape and stabilizes the searching process.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Claim 1
  • Claim 2