DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

Guangrun Wang; Changlin Li; Liuchun Yuan; Jiefeng Peng; Xiaoyu Xian; Xiaodan Liang; Xiaojun Chang; Liang Lin

DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

Guangrun Wang, Changlin Li, Liuchun Yuan, Jiefeng Peng, Xiaoyu Xian, Xiaodan Liang, Xiaojun Chang, Liang Lin

TL;DR

This work modularizes a large search space into blocks with small search spaces and develops a family of models, namely a DNA family, that are capable of resolving multiple dilemmas of the weight-sharing NAS, such as scalability, efficiency, and multi-modal compatibility.

Abstract

Neural Architecture Search (NAS), aiming at automatically designing neural architectures by machines, has been considered a key step toward automatic machine learning. One notable NAS branch is the weight-sharing NAS, which significantly improves search efficiency and allows NAS algorithms to run on ordinary computers. Despite receiving high expectations, this category of methods suffers from low search effectiveness. By employing a generalization boundedness tool, we demonstrate that the devil behind this drawback is the untrustworthy architecture rating with the oversized search space of the possible architectures. Addressing this problem, we modularize a large search space into blocks with small search spaces and develop a family of models with the distilling neural architecture (DNA) techniques. These proposed models, namely a DNA family, are capable of resolving multiple dilemmas of the weight-sharing NAS, such as scalability, efficiency, and multi-modal compatibility. Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a subsearch space using heuristic algorithms. Moreover, under a certain computational complexity constraint, our method can seek architectures with different depths and widths. Extensive experimental evaluations show that our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively. Additionally, we provide in-depth empirical analysis and insights into neural architecture ratings. Codes available: \url{https://github.com/changlin31/DNA}.

DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 15 equations, 13 figures, 15 tables, 2 algorithms)

This paper contains 23 sections, 2 theorems, 15 equations, 13 figures, 15 tables, 2 algorithms.

Introduction
Related Work
Methodology
Basic Analysis of Weight-sharing NAS's Dilemma
Modularizing Search Space into Blocks
DNA: Distillation via Supervising Learning
DNA+: Distillation via Progressive Learning
DNA++: Distillation via Self-Supervised Learning
Experiments
Datasets.
Search Spaces and Architecture Details
Searching on MBConv Search Space
DNA
DNA+
Searching for Vision Transformers
...and 8 more sections

Key Result

Theorem 1

(Generalization boundedness). For any subnet $\alpha_j$, we use $\psi_j^{sup}$ to denote its sub-optimal weights extracted from a trained supernet and use $\psi_j^{*}$ to denote its ideal weights when trained alone. Then, the Frobenius norm of $\psi_j^{sup}$ is upper bounded by: where , , and are constants.

Figures (13)

Figure 1: Illustration of DNA family. (a) Distilling neural architecture technique with block-wise supervision. Architecture candidates (denoted by different nodes and paths) are divided into blocks. (b) Supervised learning (vanilla DNA). (c) Progressive learning (DNA+). (d) Self-supervised learning (DNA++).
Figure 2: Illustration of our DNA. The teacher's preceding feature map is used as input for both teacher and student blocks. Each cell of the supernet is trained independently to mimic the behavior of the corresponding teacher block by minimizing the L2 distance between their output feature maps. The dotted lines indicate randomly sampled paths in a cell. (Best viewed in color)
Figure 3: Illustration of DNA+. In the first generation, we use an existing model as the teacher model. Then, at each consecutive generation, a new teacher is obtained by scaling the searched architecture of the previous generation and retraining the scaled architecture. The finally searched architecture is the optimal student $\alpha^{M*}$ in the last generation, which is retrained without scaling.
Figure 4: Illustration of DNA++. It uses self-supervisions to replace existing supervising teachers, avoiding architecture shifts (referring to a phenomenon that students with a similar architecture to a teacher tend to be favored when the teacher is traditional). DNA++ contains two losses, i.e., a self-supervised loss and a particular loss to remove redundant non-learnable supernets.
Figure 5: Trade-off between model accuracy and model complexity on ImageNet. Left: comparison among our scaled DNA models, EfficientNets, and SCARLET on ImageNet by accuracy vs. parameter numbers. Mid: model accuracy vs. parameter numbers; Right: model accuracy vs. FLOPs.
...and 8 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Proof 1

DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

TL;DR

Abstract

DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (3)