Table of Contents
Fetching ...

OStr-DARTS: Differentiable Neural Architecture Search based on Operation Strength

Le Yang, Ziwei Zheng, Yizeng Han, Shiji Song, Gao Huang, Fan Li

TL;DR

It is shown that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS.

Abstract

Differentiable architecture search (DARTS) has emerged as a promising technique for effective neural architecture search, and it mainly contains two steps to find the high-performance architecture: First, the DARTS supernet that consists of mixed operations will be optimized via gradient descent. Second, the final architecture will be built by the selected operations that contribute the most to the supernet. Although DARTS improves the efficiency of NAS, it suffers from the well-known degeneration issue which can lead to deteriorating architectures. Existing works mainly attribute the degeneration issue to the failure of its supernet optimization, while little attention has been paid to the selection method. In this paper, we cease to apply the widely-used magnitude-based selection method and propose a novel criterion based on operation strength that estimates the importance of an operation by its effect on the final loss. We show that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS. The experiments on NAS-Bench-201 and DARTS search spaces show the effectiveness of our method.

OStr-DARTS: Differentiable Neural Architecture Search based on Operation Strength

TL;DR

It is shown that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS.

Abstract

Differentiable architecture search (DARTS) has emerged as a promising technique for effective neural architecture search, and it mainly contains two steps to find the high-performance architecture: First, the DARTS supernet that consists of mixed operations will be optimized via gradient descent. Second, the final architecture will be built by the selected operations that contribute the most to the supernet. Although DARTS improves the efficiency of NAS, it suffers from the well-known degeneration issue which can lead to deteriorating architectures. Existing works mainly attribute the degeneration issue to the failure of its supernet optimization, while little attention has been paid to the selection method. In this paper, we cease to apply the widely-used magnitude-based selection method and propose a novel criterion based on operation strength that estimates the importance of an operation by its effect on the final loss. We show that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS. The experiments on NAS-Bench-201 and DARTS search spaces show the effectiveness of our method.
Paper Structure (30 sections, 1 theorem, 19 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 1 theorem, 19 equations, 13 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Without loss of generality, consider one cell from a simplified search space consisting of two operations: skip connection and conv. Let $m^*$ denote the optimal feature map, which is shared across all edges according to the unrolled estimation view greff2016highway. Let $o_{conv}(x_e)$ be the outpu Then optimal $\beta_{skip}$ and $\beta_{conv}$ minimizing $var(\overline{\boldsymbol{o}}^e-m^*)$ ar

Figures (13)

  • Figure 1: The searching procedure of DARTS: (a) The continuous search space that applies a mixture of candidate operations on each edge. (b) The trained supernet obtained by jointly optimization of $\mathbf{w}$ and $\boldsymbol{\alpha}$. (c) The final architecture selected by $\boldsymbol{\beta}$ ($\boldsymbol{\beta}=softmax(\boldsymbol{\alpha})$).
  • Figure 2: Illustration of the proposed selection criterion (best viewed in color). With the search space (a), we first obtain the optimized supernet (b) via the gradient descent algorithm. Then, importance estimation (c) will be conducted: The Operation strength of Op1 (the purple one) will be calculated by the change of final loss when selecting Op1 as the last operation to replace the mixed one. We do this procedure for each operation and then Op2 is selected as the target operation, although the $\beta_2$ does not have the largest value. We then repeat this procedure for the rest edges to generate the final architecture (e).
  • Figure 3: The differences between (a) pruning and (b) architecture selection.
  • Figure 4: Importance estimation between (a) our method and (b) the naive implementation as it in network pruning.
  • Figure 5: NAS-Bench-201 search space. (a) The cell architecture. (b) The optimal cell.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Remark 1
  • Proposition 1