Table of Contents
Fetching ...

Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search

Matteo Gambella, Fabrizio Pittorino, Manuel Roveri

Abstract

Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (A$^2$M), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. A$^2$M consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, A$^2$M is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. A$^2$M can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.

Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search

Abstract

Neural Architecture Search (NAS) has become an essential tool for designing effective and efficient neural networks. In this paper, we investigate the geometric properties of neural architecture spaces commonly used in differentiable NAS methods, specifically NAS-Bench-201 and DARTS. By defining flatness metrics such as neighborhoods and loss barriers along paths in architecture space, we reveal locality and flatness characteristics analogous to the well-known properties of neural network loss landscapes in weight space. In particular, we find that highly accurate architectures cluster together in flat regions, while suboptimal architectures remain isolated, unveiling the detailed geometrical structure of the architecture search landscape. Building on these insights, we propose Architecture-Aware Minimization (AM), a novel analytically derived algorithmic framework that explicitly biases, for the first time, the gradient of differentiable NAS methods towards flat minima in architecture space. AM consistently improves generalization over state-of-the-art DARTS-based algorithms on benchmark datasets including CIFAR-10, CIFAR-100, and ImageNet16-120, across both NAS-Bench-201 and DARTS search spaces. Notably, AM is able to increase the test accuracy, on average across different differentiable NAS methods, by +3.60\% on CIFAR-10, +4.60\% on CIFAR-100, and +3.64\% on ImageNet16-120, demonstrating its superior effectiveness in practice. AM can be easily integrated into existing differentiable NAS frameworks, offering a versatile tool for future research and applications in automated machine learning. We open-source our code at https://github.com/AI-Tech-Research-Lab/AsquaredM.

Paper Structure

This paper contains 35 sections, 29 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Example of a NAS-Bench-201 architecture and its radius-1 neighbor. The upper part shows the vector encoding of a NAS-Bench-201 architecture (left) and its radius-1 neighbor (right). The set of operations corresponding to the indexes of $\mathcal{O}$ are {0: none, 1: skip_connect, 2: nor_conv_1x1, 3: nor_conv_3x3, 4: avg_pool3x3} with nor_conv referring to the standard (normal) convolution with dense connections. The lower part shows the graph of the NAS-Bench-201 architecture (left) and one of its radius-1 neighbor (right). The neighbor architecture is obtained by adding a new connection with a random possible operation from node 1 to node 2. This corresponds to updating the $3^{rd}$ element in the vector encoding.
  • Figure 2: Example of a DARTS architecture and one of its radius-1 neighbor. The upper part shows the adjacency matrices of the DARTS architecture (left) and its radius-1 neighbor (right). The set of operations corresponding to the indexes of $\mathcal{O}$ are {0: none, 1: max_pool3x3, 2: avg_pool3x3, 3: skip_connect, 4: sep_conv_3x3, 5: sep_conv5x5, 6: dil_conv_3x3, 7: dil_conv5x5} with dil_conv referring to dilated convolution and sep_conv referring to separable convolution. The two lower graphs represent the DARTS architecture (up) and its radius-1 neighbor (down) corresponding to their adjacency matrices. The neighbor architecture is obtained by removing the connection from intermediate node 0 to node 1 and adding a new connection from the input node $c_{k-2}$ to intermediate node 1. This corresponds to updating the elements in positions $(1,1)$ and $(1,2)$ in the matrix.
  • Figure 3: (Up) Visualization of a neighbor tree illustrating the neighborhood relationships for sequences of length 3 over a set of operations of length 3 and whose root is the initial configuration $[0,1,2]$. (Down) Visualization of a path tree with a sequence of length 3 and a set of operations of length 3 from $[0,1,2]$ to $[1,2,1]$ - i.e. two architectures at maximal radius. The duplicates are counted once (see the tree on the right of the arrow).
  • Figure 4: Density histograms of test accuracies over radius-1 neighborhoods on NAS-Bench-201 and on CIFAR-10 (left), CIFAR-100 (middle) and ImageNet-16-120 (right), for different accuracy ranges of reference architectures in the search space. The red shaded area refers to the range of test accuracies of the reference architectures (also reported in each subplot title). For each dataset, three accuracy ranges for the reference architecture were identified (corresponding to high, medium, and low performance), according to the difficulty of the dataset. The green distribution refers to the accuracies over the whole search space, while the blue one refers to the accuracies over the radius-1 neighborhoods only. The blue histograms reveal that radius-1 neighborhoods tend to have similar accuracies to their reference architectures, independently from the evident bias of the search space towards well-performing architectures revealed by the green histograms (see Appendix \ref{['statanalysis']} for Kolomogorov-Smirnov tests quantitatively confirming this observation). Architectures with similar accuracies tend to geometrically cluster together, and there exist flat basins of architectures in the accuracy landscape.
  • Figure 5: Distributions of differences of test accuracies between random models in the search space and between reference architectures and their radius-1 neighborhoods on NAS-Bench-201 on CIFAR-10 (left), CIFAR-100 (middle), and ImageNet-16-120 (right), for different accuracy ranges of reference architectures in the search space. For each dataset, three accuracy ranges for the reference architecture were identified (corresponding to high, medium, and low performance), according to the difficulty of the dataset. The blue distribution refers to the random models, while the other ones refer to the reference architectures and their neighborhood. The higher density of the orange distribution on CIFAR-10 reveals the clustering property, being that models with similar accuracies tend to geometrically cluster together.
  • ...and 8 more figures