Table of Contents
Fetching ...

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

Shang Wang

TL;DR

Weight-Weighted PCA (W-PCA) introduces a gradient-free zero-shot NAS proxy for lightweight language models, combining parameter count with PCA-based information content in FFN layers to rank architectures without training. By using a GA over a specially designed NLU search space and subsequent KD fine-tuning, W-PCA achieves dramatically faster search (2–3 orders of magnitude) and higher GLUE/SQuAD performance than prior NAS methods. The study demonstrates strong ranking correlations and competitive accuracy, with ablations confirming the advantage of the W-PCA product over individual proxies. The work suggests practical implications for efficient deployment of lightweight LMs and highlights potential extensions to larger generative models.

Abstract

The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding $η$ in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

TL;DR

Weight-Weighted PCA (W-PCA) introduces a gradient-free zero-shot NAS proxy for lightweight language models, combining parameter count with PCA-based information content in FFN layers to rank architectures without training. By using a GA over a specially designed NLU search space and subsequent KD fine-tuning, W-PCA achieves dramatically faster search (2–3 orders of magnitude) and higher GLUE/SQuAD performance than prior NAS methods. The study demonstrates strong ranking correlations and competitive accuracy, with ablations confirming the advantage of the W-PCA product over individual proxies. The work suggests practical implications for efficient deployment of lightweight LMs and highlights potential extensions to larger generative models.

Abstract

The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.

Paper Structure

This paper contains 42 sections, 14 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Comparison of the running time between W-PCA and other training-based NAS methods for lightweight language models. Our method achieves a substantial reduction in search time for the optimal network structure by two to three orders of magnitude, as we do not need to train the supernet.
  • Figure 2: Plots depicting the evaluation of zero-shot proxy metrics on 500 randomly sampled architectures from the FlexiBERT search space. As in literature serianni-kalita-2023-training, we use the GLUE score of each neural network as the ground truth and evaluate the performance of each zero-shot proxy metric based on its ranking correlation with the ground truth. The specific calculations of PCA is described in Section \ref{['subsec:v_pca']}, and the respective zero-shot proxies used for the comparison are summarized in Section \ref{['subsec:zero_shot']}. Our metric W-PCA is calculated as the product of the number of parameters (#params) and the principal component analysis (PCA).
  • Figure 3: (a) and (b) show the PCA score curves for BERT devlin2019bert and MobileBERT sun2020mobilebert, respectively, at different epochs during training ($\eta$=0.99). (c) presents the progression of GLUE scores for BERT and MobileBERT over training epochs.
  • Figure 4: Overview of the W-PCA framework for NLU tasks. The search space consists of $m$ layers, each with 2 candidate blocks and $n$ candidate dimensions, resulting in a total of $(2\times n)^m$ combinations. A genetic algorithm (detailed parameterization provided in Section \ref{['subsubsec:search_space']}) is employed to identify the optimal structure with the highest W-PCA value. This structure is subsequently refined through additional training using knowledge distillation (KD). In the figure, FFN and MHA represent the feed-forward network and multi-head attention, respectively.
  • Figure 5: Evaluation of zero-shot metrics with various initialization weights in the FlexiBERT search space. Ten architectures are randomly sampled from the search space, representing decile ranges of the GLUE score (e.g., 0-10%, 10-20%, ..., 90-100%). To ensure robustness, ten different random seeds are employed for weight initialization.
  • ...and 4 more figures