The devil is in discretization discrepancy. Robustifying Differentiable NAS with Single-Stage Searching Protocol
Konstanty Subbotko, Wojciech Jablonski, Piotr Bilinski
TL;DR
The paper addresses discretization discrepancy and entropy-regularization challenges in differentiable NAS by proposing a fully proxyless, single-stage searching protocol that freezes the architecture and reuses weights to bypass decoding and retraining. The approach demonstrates strong Cityscapes results, achieving 75.3% mIoU in the searching stage and surpassing DCNAS on non-dense search spaces, with a total training budget of about 5.5 GPU days. It also reveals limitations of entropy-based regularization and introduces a dataset-split strategy to prevent architecture degeneration in DARTS, while validating the method's efficiency and robustness. Overall, the work offers a practical, low-cost NAS paradigm that preserves performance while reducing computational overhead and improving stability, with potential for richer search spaces including long-range connections.
Abstract
Neural Architecture Search (NAS) has been widely adopted to design neural networks for various computer vision tasks. One of its most promising subdomains is differentiable NAS (DNAS), where the optimal architecture is found in a differentiable manner. However, gradient-based methods suffer from the discretization error, which can severely damage the process of obtaining the final architecture. In our work, we first study the risk of discretization error and show how it affects an unregularized supernet. Then, we present that penalizing high entropy, a common technique of architecture regularization, can hinder the supernet's performance. Therefore, to robustify the DNAS framework, we introduce a novel single-stage searching protocol, which is not reliant on decoding a continuous architecture. Our results demonstrate that this approach outperforms other DNAS methods by achieving 75.3% in the searching stage on the Cityscapes validation dataset and attains performance 1.1% higher than the optimal network of DCNAS on the non-dense search space comprising short connections. The entire training process takes only 5.5 GPU days due to the weight reuse, and yields a computationally efficient architecture. Additionally, we propose a new dataset split procedure, which substantially improves results and prevents architecture degeneration in DARTS.
