Table of Contents
Fetching ...

ZeroLM: Data-Free Transformer Architecture Search for Language Models

Zhen-Song Chen, Hong-Wei Ding, Xian-Jia Wang, Witold Pedrycz

TL;DR

ZeroLM presents a data-free Transformer NAS proxy that estimates model capacity from weight statistics and decouples Attention from FFN blocks to improve ranking accuracy. By defining a simple SVD-based module capacity measure and aggregating it with a tunable balance parameter $\alpha$, the method ranks architectures without data and with minimal computation. Two lightweight strategies determine $\alpha$: benchmark sampling and a heuristic correlation approach, enabling task adaptation without large datasets. Empirical results across FlexiBERT, GPT-2, and LoNAS demonstrate strong rank correlations (e.g., Spearman's $\rho \approx 0.76$, Kendall's $\tau \approx 0.53$ on FlexiBERT) and substantial efficiency gains, suggesting practical utility for large-scale Transformer NAS and pruning.

Abstract

Neural architecture search (NAS) provides a systematic framework for automating the design of neural network architectures, yet its widespread adoption is hindered by prohibitive computational requirements. Existing zero-cost proxy methods, while reducing search overhead, demonstrate inadequate performance in architecture ranking tasks, particularly for Transformer-based models where they often underperform simple parameter counting metrics. Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics computation while decomposing Transformer architectures into functionally distinct sub-modules, thereby optimizing the balance of their contributions to overall performance. Our comprehensive evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark. The proposed method exhibits exceptional computational efficiency while maintaining robust performance across diverse NAS benchmark tasks, offering a practical solution for large-scale architecture search.

ZeroLM: Data-Free Transformer Architecture Search for Language Models

TL;DR

ZeroLM presents a data-free Transformer NAS proxy that estimates model capacity from weight statistics and decouples Attention from FFN blocks to improve ranking accuracy. By defining a simple SVD-based module capacity measure and aggregating it with a tunable balance parameter , the method ranks architectures without data and with minimal computation. Two lightweight strategies determine : benchmark sampling and a heuristic correlation approach, enabling task adaptation without large datasets. Empirical results across FlexiBERT, GPT-2, and LoNAS demonstrate strong rank correlations (e.g., Spearman's , Kendall's on FlexiBERT) and substantial efficiency gains, suggesting practical utility for large-scale Transformer NAS and pruning.

Abstract

Neural architecture search (NAS) provides a systematic framework for automating the design of neural network architectures, yet its widespread adoption is hindered by prohibitive computational requirements. Existing zero-cost proxy methods, while reducing search overhead, demonstrate inadequate performance in architecture ranking tasks, particularly for Transformer-based models where they often underperform simple parameter counting metrics. Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics computation while decomposing Transformer architectures into functionally distinct sub-modules, thereby optimizing the balance of their contributions to overall performance. Our comprehensive evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark. The proposed method exhibits exceptional computational efficiency while maintaining robust performance across diverse NAS benchmark tasks, offering a practical solution for large-scale architecture search.

Paper Structure

This paper contains 25 sections, 12 equations, 17 figures, 12 tables, 2 algorithms.

Figures (17)

  • Figure 1: Traditional ZCP vs. ZeroLM in terms of ranking ability, data bias and time cost.
  • Figure 2: Computing Flow of Proxy Metric.
  • Figure 3: Comparing the performances of Log Complexity and #params for randomly sampling 500 architectures on the FlexiBERT Benchmark. There is a negative correlation between the Log Complexity and the real performance, and the performance is similar to #params. This demonstrates the effectiveness of using the norm for measurement, but it is not accurate enough.
  • Figure 4: Comparing the changes in the correlation when using openwebtext and wikitext-103 as input data for randomly sampling 500 architectures on the FlexiBERT benchmark.
  • Figure 5: Comparing the performances of FFN-only, Attn-only, Both-only and #params for randomly sampling 500 architectures on the FlexiBERT Benchmark.
  • ...and 12 more figures