Table of Contents
Fetching ...

The devil is in discretization discrepancy. Robustifying Differentiable NAS with Single-Stage Searching Protocol

Konstanty Subbotko, Wojciech Jablonski, Piotr Bilinski

TL;DR

The paper addresses discretization discrepancy and entropy-regularization challenges in differentiable NAS by proposing a fully proxyless, single-stage searching protocol that freezes the architecture and reuses weights to bypass decoding and retraining. The approach demonstrates strong Cityscapes results, achieving 75.3% mIoU in the searching stage and surpassing DCNAS on non-dense search spaces, with a total training budget of about 5.5 GPU days. It also reveals limitations of entropy-based regularization and introduces a dataset-split strategy to prevent architecture degeneration in DARTS, while validating the method's efficiency and robustness. Overall, the work offers a practical, low-cost NAS paradigm that preserves performance while reducing computational overhead and improving stability, with potential for richer search spaces including long-range connections.

Abstract

Neural Architecture Search (NAS) has been widely adopted to design neural networks for various computer vision tasks. One of its most promising subdomains is differentiable NAS (DNAS), where the optimal architecture is found in a differentiable manner. However, gradient-based methods suffer from the discretization error, which can severely damage the process of obtaining the final architecture. In our work, we first study the risk of discretization error and show how it affects an unregularized supernet. Then, we present that penalizing high entropy, a common technique of architecture regularization, can hinder the supernet's performance. Therefore, to robustify the DNAS framework, we introduce a novel single-stage searching protocol, which is not reliant on decoding a continuous architecture. Our results demonstrate that this approach outperforms other DNAS methods by achieving 75.3% in the searching stage on the Cityscapes validation dataset and attains performance 1.1% higher than the optimal network of DCNAS on the non-dense search space comprising short connections. The entire training process takes only 5.5 GPU days due to the weight reuse, and yields a computationally efficient architecture. Additionally, we propose a new dataset split procedure, which substantially improves results and prevents architecture degeneration in DARTS.

The devil is in discretization discrepancy. Robustifying Differentiable NAS with Single-Stage Searching Protocol

TL;DR

The paper addresses discretization discrepancy and entropy-regularization challenges in differentiable NAS by proposing a fully proxyless, single-stage searching protocol that freezes the architecture and reuses weights to bypass decoding and retraining. The approach demonstrates strong Cityscapes results, achieving 75.3% mIoU in the searching stage and surpassing DCNAS on non-dense search spaces, with a total training budget of about 5.5 GPU days. It also reveals limitations of entropy-based regularization and introduces a dataset-split strategy to prevent architecture degeneration in DARTS, while validating the method's efficiency and robustness. Overall, the work offers a practical, low-cost NAS paradigm that preserves performance while reducing computational overhead and improving stability, with potential for richer search spaces including long-range connections.

Abstract

Neural Architecture Search (NAS) has been widely adopted to design neural networks for various computer vision tasks. One of its most promising subdomains is differentiable NAS (DNAS), where the optimal architecture is found in a differentiable manner. However, gradient-based methods suffer from the discretization error, which can severely damage the process of obtaining the final architecture. In our work, we first study the risk of discretization error and show how it affects an unregularized supernet. Then, we present that penalizing high entropy, a common technique of architecture regularization, can hinder the supernet's performance. Therefore, to robustify the DNAS framework, we introduce a novel single-stage searching protocol, which is not reliant on decoding a continuous architecture. Our results demonstrate that this approach outperforms other DNAS methods by achieving 75.3% in the searching stage on the Cityscapes validation dataset and attains performance 1.1% higher than the optimal network of DCNAS on the non-dense search space comprising short connections. The entire training process takes only 5.5 GPU days due to the weight reuse, and yields a computationally efficient architecture. Additionally, we propose a new dataset split procedure, which substantially improves results and prevents architecture degeneration in DARTS.
Paper Structure (13 sections, 4 equations, 4 figures, 7 tables)

This paper contains 13 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of the single-stage searching protocol. We replace both the decoding and the retraining stages with a new fine-tuning phase, during which architecture is frozen. By reusing weights, we save a considerable amount of the retraining time. We keep the optimized architectural parameters in the final network, which means that edges in a supernet take on real values, unlike in the standard DNAS framework.
  • Figure 2: Visualization of the discretization error across different entropy loss magnitudes. For more details, see \ref{['subsection:discretization_error']}.
  • Figure 3: The average entropy of architectural parameters throughout the training. Dashed and solid curves correspond to supernets trained with the constant and the linear entropy scaling function, as described in \ref{['subsec:entropy_loss']}. Curves denoted by -, M, and H refer to supernets trained without entropy loss, with medium entropy loss, and with high entropy loss, respectively.
  • Figure 4: (left) dominant eigenvalues of $\nabla_{\alpha}^2\mathcal{L}_{valid}$ on four different search spaces with a dataset split; (right) dominant eigenvalues when searching on a single dataset. All experiments were conducted on CIFAR 10 dataset.