Table of Contents
Fetching ...

Neural Architecture Search by Learning a Hierarchical Search Space

Mehraveh Javan Roshtkhari, Matthew Toews, Marco Pedersoli

TL;DR

The paper addresses the efficiency of Neural Architecture Search by improving the exploration strategy of Monte-Carlo Tree Search through a learned hierarchical search space. It proposes constructing this hierarchy by clustering architectures based on pairwise distances between their output vectors produced by a partially trained supernet, enabling semantically meaningful early splits in the search tree. Empirically, the method yields state-of-the-art or competitive results on CIFAR10 (Pooling and NAS-Bench-Macro) and ImageNet under constrained computational budgets, without the need for additional regularization. This approach enhances NAS practicality by accelerating convergence and improving final architecture quality through a data-driven tree structure that better guides exploration in the search space.

Abstract

Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.

Neural Architecture Search by Learning a Hierarchical Search Space

TL;DR

The paper addresses the efficiency of Neural Architecture Search by improving the exploration strategy of Monte-Carlo Tree Search through a learned hierarchical search space. It proposes constructing this hierarchy by clustering architectures based on pairwise distances between their output vectors produced by a partially trained supernet, enabling semantically meaningful early splits in the search tree. Empirically, the method yields state-of-the-art or competitive results on CIFAR10 (Pooling and NAS-Bench-Macro) and ImageNet under constrained computational budgets, without the need for additional regularization. This approach enhances NAS practicality by accelerating convergence and improving final architecture quality through a data-driven tree structure that better guides exploration in the search space.

Abstract

Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.

Paper Structure

This paper contains 25 sections, 4 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Probability factorization of 8 architectures. We show different ways to approximate the discrete probability distribution of architectures for a toy example of search space with N=3 nodes (a,b,c in the figure) each one with O=2 possible operations for a total of $2^3$ architectures. (left) Assuming the nodes independent (as in DARTS liu2018darts) allows the model to estimate only $N \times O$ probabilities. (center) Considering the joint probabilities would require to estimate $O^N$ different probabilities (as in Boltzmann sampling). (right) The joint probability can be factorized into the product of conditional probabilities (in a hierarchy such as in MCTS). This does not reduce the probabilities to estimate, but allows a more efficient exploration of the search space.
  • Figure 2: Comparison of the standard tree structure and our learned structure on a 3 binary operations search space. (a) The search space consists of architectures with 3 binary operations ($o_a,o_b,o_c$) which leads to 8 architectures ($a_1,a_2,...,a_8$). (b) The default tree structure uses the order of operations (e.g. layers) to build the tree, however this is not optimal. (c) Our learned tree structure uses a tree that is generated by an agglomerative clustering on the model outputs.
  • Figure 3: (left) Training epochs for estimating the similarity matrix. We show the final performance of our MCTS in which the tree structure is learned with a model uniformly trained for a given number of epochs. For best results at least 200 epochs are needed; (right) Accuracy over epochs for several training strategies. After the warm-up phase, our approach is constantly better than default tree or MCTS with a randomly selected tree.
  • Figure 4: Normalized distance matrices calculated with various methods. (left) Distance matrix calculated from output vectors (our method) ; (middle) From vector encoding ; (right) From one-hot encoding. The architecture indices on leaves correspond to indices used in Pooling benchmark roshtkhari2023balanced.
  • Figure 5: Tree branching for Pooling search space by hierarchical clustering. The architecture indices on leaves correspond to indices used in Pooling benchmark roshtkhari2023balanced (left) Tree learned from output vectors (our method) ; (middle) From vector encoding ; (right) From one-hot encoding.