Table of Contents
Fetching ...

Weight-Entanglement Meets Gradient-Based Neural Architecture Search

Rhea Sanjay Sukthanker, Arjun Krishnakumar, Mahmoud Safari, Frank Hutter

TL;DR

This work bridges weight-sharing gradient-based NAS and weight-entangled macro spaces by introducing TangleNAS, a scheme that employs weight-superposition and combi-superposition to enable single-stage NAS in WE spaces. It reduces memory and forward-pass cost while preserving the expressiveness of macro-level architectural choices, and demonstrates strong performance across toy, cell-based, and macro spaces—including AutoFormer, MobileNetV3, and language-model search spaces—often surpassing two-stage baselines. The results show competitive or superior accuracy, improved anytime performance, and meaningful reductions in memory usage, with detailed analyses of architecture design choices, pretraining/fine-tuning/retraining effects, and transfer to ImageNet. This approach advances practical NAS by enabling efficient exploration of broad architectural spaces, potentially accelerating the design of scalable transformers and other large models.

Abstract

Weight sharing is a fundamental concept in neural architecture search (NAS), enabling gradient-based methods to explore cell-based architectural spaces significantly faster than traditional black-box approaches. In parallel, weight-entanglement has emerged as a technique for more intricate parameter sharing amongst macro-architectural spaces. Since weight-entanglement is not directly compatible with gradient-based NAS methods, these two paradigms have largely developed independently in parallel sub-communities. This paper aims to bridge the gap between these sub-communities by proposing a novel scheme to adapt gradient-based methods for weight-entangled spaces. This enables us to conduct an in-depth comparative assessment and analysis of the performance of gradient-based NAS in weight-entangled search spaces. Our findings reveal that this integration of weight-entanglement and gradient-based NAS brings forth the various benefits of gradient-based methods, while preserving the memory efficiency of weight-entangled spaces. The code for our work is openly accessible https://github.com/automl/TangleNAS.

Weight-Entanglement Meets Gradient-Based Neural Architecture Search

TL;DR

This work bridges weight-sharing gradient-based NAS and weight-entangled macro spaces by introducing TangleNAS, a scheme that employs weight-superposition and combi-superposition to enable single-stage NAS in WE spaces. It reduces memory and forward-pass cost while preserving the expressiveness of macro-level architectural choices, and demonstrates strong performance across toy, cell-based, and macro spaces—including AutoFormer, MobileNetV3, and language-model search spaces—often surpassing two-stage baselines. The results show competitive or superior accuracy, improved anytime performance, and meaningful reductions in memory usage, with detailed analyses of architecture design choices, pretraining/fine-tuning/retraining effects, and transfer to ImageNet. This approach advances practical NAS by enabling efficient exploration of broad architectural spaces, potentially accelerating the design of scalable transformers and other large models.

Abstract

Weight sharing is a fundamental concept in neural architecture search (NAS), enabling gradient-based methods to explore cell-based architectural spaces significantly faster than traditional black-box approaches. In parallel, weight-entanglement has emerged as a technique for more intricate parameter sharing amongst macro-architectural spaces. Since weight-entanglement is not directly compatible with gradient-based NAS methods, these two paradigms have largely developed independently in parallel sub-communities. This paper aims to bridge the gap between these sub-communities by proposing a novel scheme to adapt gradient-based methods for weight-entangled spaces. This enables us to conduct an in-depth comparative assessment and analysis of the performance of gradient-based NAS in weight-entangled search spaces. Our findings reveal that this integration of weight-entanglement and gradient-based NAS brings forth the various benefits of gradient-based methods, while preserving the memory efficiency of weight-entangled spaces. The code for our work is openly accessible https://github.com/automl/TangleNAS.
Paper Structure (66 sections, 9 figures, 26 tables, 4 algorithms)

This paper contains 66 sections, 9 figures, 26 tables, 4 algorithms.

Figures (9)

  • Figure 1: (a) Two-Stage NAS with WE (Algorithm \ref{['alg:we']}): dotted paths show operation choices not sampled at the given step (b) Single-Stage NAS with WS (Algorithm \ref{['alg:ws']}): every operation choice is evaluated independently and contributes to the output feature map with corresponding architecture parameters (c) Single-Stage NAS with WE (Algorithm \ref{['alg:we2']}): operation choices superimposed with corresponding architecture parameters. The architecture parameters for the three operation choices are represented by ${[\alpha_i]}_{i=1}^{3}$ and ${[\beta_i]}_{i=1}^{3}$. The operation weights, or choices, are symbolized by cubes (for convolutional layers) or rectangles (for feedforward layers) in various colors. In scenarios (a) and (c), due to weight entanglement, the smaller weights are effectively structured subsets of the larger weights. Conversely, in (b), through weight-sharing, operation weights are maintained independently from each other. In both (b) and (c), to determine the optimal architecture, the operations associated with the highest architecture parameter value are selected. This selection process applies to the choice of kernel size and the output dimension of the feedforward network.
  • Figure 2: Weight superposition with architecture parameters ${\alpha_i}_{i=1}^{3}$ for kernel size search. Supernet weight matrix (LHS) is adapted to gradient-based methods (RHS).
  • Figure 3: Combi-superposition with parameters $\alpha_i\beta_j$. Supernet weight matrix (LHS) is adapted to gradient-based methods (RHS).
  • Figure 4: Test accuracy evolution over epochs for NB201.
  • Figure 5: Any Time performance curves of AutoFormer vs. Ours.
  • ...and 4 more figures