Table of Contents
Fetching ...

When Training-Free NAS Meets Vision Transformer: A Neural Tangent Kernel Perspective

Qiqi Zhou, Yichen Zhu

TL;DR

The paper shows that standard NTK-based training-free NAS metrics fail to predict Vision Transformer performance due to ViT's reliance on high-frequency features. It provides a theoretical bound (Theorem 1) indicating NTK mainly captures low-frequency learning and introduces ViNTK by combining NTK with Fourier features to capture high-frequency content. Empirically, ViNTK yields dramatically faster NAS (on the order of 27–30x speedups in key search spaces) while maintaining or improving accuracy in image classification on ImageNet-1K and semantic segmentation on Cityscapes and ADE20K. This approach enables resource-efficient, scalable NAS for ViT architectures and demonstrates practical benefits over prior training-free NAS methods.

Abstract

This paper investigates the Neural Tangent Kernel (NTK) to search vision transformers without training. In contrast with the previous observation that NTK-based metrics can effectively predict CNNs performance at initialization, we empirically show their inefficacy in the ViT search space. We hypothesize that the fundamental feature learning preference within ViT contributes to the ineffectiveness of applying NTK to NAS for ViT. We both theoretically and empirically validate that NTK essentially estimates the ability of neural networks that learn low-frequency signals, completely ignoring the impact of high-frequency signals in feature learning. To address this limitation, we propose a new method called ViNTK that generalizes the standard NTK to the high-frequency domain by integrating the Fourier features from inputs. Experiments with multiple ViT search spaces on image classification and semantic segmentation tasks show that our method can significantly speed up search costs over prior state-of-the-art NAS for ViT while maintaining similar performance on searched architectures.

When Training-Free NAS Meets Vision Transformer: A Neural Tangent Kernel Perspective

TL;DR

The paper shows that standard NTK-based training-free NAS metrics fail to predict Vision Transformer performance due to ViT's reliance on high-frequency features. It provides a theoretical bound (Theorem 1) indicating NTK mainly captures low-frequency learning and introduces ViNTK by combining NTK with Fourier features to capture high-frequency content. Empirically, ViNTK yields dramatically faster NAS (on the order of 27–30x speedups in key search spaces) while maintaining or improving accuracy in image classification on ImageNet-1K and semantic segmentation on Cityscapes and ADE20K. This approach enables resource-efficient, scalable NAS for ViT architectures and demonstrates practical benefits over prior training-free NAS methods.

Abstract

This paper investigates the Neural Tangent Kernel (NTK) to search vision transformers without training. In contrast with the previous observation that NTK-based metrics can effectively predict CNNs performance at initialization, we empirically show their inefficacy in the ViT search space. We hypothesize that the fundamental feature learning preference within ViT contributes to the ineffectiveness of applying NTK to NAS for ViT. We both theoretically and empirically validate that NTK essentially estimates the ability of neural networks that learn low-frequency signals, completely ignoring the impact of high-frequency signals in feature learning. To address this limitation, we propose a new method called ViNTK that generalizes the standard NTK to the high-frequency domain by integrating the Fourier features from inputs. Experiments with multiple ViT search spaces on image classification and semantic segmentation tasks show that our method can significantly speed up search costs over prior state-of-the-art NAS for ViT while maintaining similar performance on searched architectures.
Paper Structure (10 sections, 4 equations, 4 figures, 2 tables)

This paper contains 10 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The correlation analysis of NTK-based metrics on CNN and ViT search spaces. Left: Kendall-Tau value on NAS-Bench-201 dong2020bench. Right: Kendall-Tau value on two ViT search spaces.
  • Figure 2: The Kendall-Tau correlation for MSAs only in two ViT search spaces. The value of $\tau$ clearly improves when only MSAs are involved in the search space.
  • Figure 3: The correlation of our proposed ViNTK.
  • Figure 4: Experiments on semantics segmentation.