Table of Contents
Fetching ...

GradAlign for Training-free Model Performance Inference

Yuxuan Li, Yunhui Guo

TL;DR

GradAlign addresses the challenge of predicting neural network performance at initialization without training by quantifying conflicts among per-sample gradients. It theoretically links gradient interference to slower convergence and proposes two metrics, GradAlign-i@ and GradAlign-ii@, based on gradient alignment and the Gram determinant. Empirical results on NAS-BENCH-101, NAS-BENCH-201, and NDS show GradAlign- variants generally outperform existing training-free NAS baselines in Kendall's $\tau$ and top-architecture selection, with favorable running times. The work also demonstrates that the number of linear regions is not a reliable initialization criterion, motivating gradient-based inference as a more robust alternative.

Abstract

Architecture plays an important role in deciding the performance of deep neural networks. However, the search for the optimal architecture is often hindered by the vast search space, making it a time-intensive process. Recently, a novel approach known as training-free neural architecture search (NAS) has emerged, aiming to discover the ideal architecture without necessitating extensive training. Training-free NAS leverages various indicators for architecture selection, including metrics such as the count of linear regions, the density of per-sample losses, and the stability of the finite-width Neural Tangent Kernel (NTK) matrix. Despite the competitive empirical performance of current training-free NAS techniques, they suffer from certain limitations, including inconsistent performance and a lack of deep understanding. In this paper, we introduce GradAlign, a simple yet effective method designed for inferring model performance without the need for training. At its core, GradAlign quantifies the extent of conflicts within per-sample gradients during initialization, as substantial conflicts hinder model convergence and ultimately result in worse performance. We evaluate GradAlign against established training-free NAS methods using standard NAS benchmarks, showing a better overall performance. Moreover, we show that the widely adopted metric of linear region count may not suffice as a dependable criterion for selecting network architectures during at initialization.

GradAlign for Training-free Model Performance Inference

TL;DR

GradAlign addresses the challenge of predicting neural network performance at initialization without training by quantifying conflicts among per-sample gradients. It theoretically links gradient interference to slower convergence and proposes two metrics, GradAlign-i@ and GradAlign-ii@, based on gradient alignment and the Gram determinant. Empirical results on NAS-BENCH-101, NAS-BENCH-201, and NDS show GradAlign- variants generally outperform existing training-free NAS baselines in Kendall's and top-architecture selection, with favorable running times. The work also demonstrates that the number of linear regions is not a reliable initialization criterion, motivating gradient-based inference as a more robust alternative.

Abstract

Architecture plays an important role in deciding the performance of deep neural networks. However, the search for the optimal architecture is often hindered by the vast search space, making it a time-intensive process. Recently, a novel approach known as training-free neural architecture search (NAS) has emerged, aiming to discover the ideal architecture without necessitating extensive training. Training-free NAS leverages various indicators for architecture selection, including metrics such as the count of linear regions, the density of per-sample losses, and the stability of the finite-width Neural Tangent Kernel (NTK) matrix. Despite the competitive empirical performance of current training-free NAS techniques, they suffer from certain limitations, including inconsistent performance and a lack of deep understanding. In this paper, we introduce GradAlign, a simple yet effective method designed for inferring model performance without the need for training. At its core, GradAlign quantifies the extent of conflicts within per-sample gradients during initialization, as substantial conflicts hinder model convergence and ultimately result in worse performance. We evaluate GradAlign against established training-free NAS methods using standard NAS benchmarks, showing a better overall performance. Moreover, we show that the widely adopted metric of linear region count may not suffice as a dependable criterion for selecting network architectures during at initialization.

Paper Structure

This paper contains 19 sections, 1 theorem, 8 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

theorem thmcountertheorem

Assuming that $L$ is a differentiable function with a gradient that satisfies an $M$-Lipschitz condition ($M > 0$), the learning rate $\lambda$ satisfying $\lambda \le \frac{1}{M}$, and the per-sample gradient norm is bounded by $\sqrt{G}$. $\theta$ is the current model parameters and the $\theta^+$

Figures (4)

  • Figure 1: The network on the left is preferable, as the model can achieve faster convergence compared to the network on the right. The dotted lines are per-sample losses and gradients. The red lines are average losses and gradients. $\theta_0$ and $\theta_1$ are the initial parameters and the parameters after one-step gradient descent. For the left network, the updated parameters $\theta_1$ are closer to the optimal parameters $\theta^*$ in comparison to the network on the right.
  • Figure 2: Visualization of model testing accuracy versus GradAlign-i@ metric score on CIFAR10, CIFAR100, ImageNet16-120.
  • Figure 3: Visualization of model testing accuracy versus GradAlign-ii@ metric score on CIFAR10, CIFAR100, ImageNet16-120.
  • Figure 4: The number of linear regions is sensitive to the parameter values. By slightly perturbing the value of one parameter (marked in red), the number of linear regions greatly increases.

Theorems & Definitions (1)

  • theorem thmcountertheorem