Table of Contents
Fetching ...

Elastic Architecture Search for Efficient Language Models

Shang Wang

TL;DR

Large pretrained language models demand substantial compute and energy, motivating compact and efficient architectures. The paper proposes Elastic Language Model (ELM), a neural architecture search framework with a flexible search space (BERT- and MobileBERT-based blocks), dynamic dimension and head search guided by PCA and CKA, and relational knowledge distillation to preserve block diversity. Through extensive experiments on masked and causal language modeling, ELM outperforms existing lightweight NAS methods and achieves competitive or superior results with far fewer parameters and lower latency. This approach offers a practical path to deploy efficient language models at scale while maintaining strong performance.

Abstract

As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.

Elastic Architecture Search for Efficient Language Models

TL;DR

Large pretrained language models demand substantial compute and energy, motivating compact and efficient architectures. The paper proposes Elastic Language Model (ELM), a neural architecture search framework with a flexible search space (BERT- and MobileBERT-based blocks), dynamic dimension and head search guided by PCA and CKA, and relational knowledge distillation to preserve block diversity. Through extensive experiments on masked and causal language modeling, ELM outperforms existing lightweight NAS methods and achieves competitive or superior results with far fewer parameters and lower latency. This approach offers a practical path to deploy efficient language models at scale while maintaining strong performance.

Abstract

As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The diagram depicts the expansion of the hidden dimension wihtin the feed-forward network (FFN), using five blocks as an example. Initially, the dimension of each block is set to 132. After each epoch, the dimensions of the top two blocks with the highest PCA values are increased by 132. It is crucial to note that the dimensions corresponding to fixed PCA values also change after their increase, which is indicated by "?" in the diagram.
  • Figure 2: Curves of PCA scores of (a) BERT devlin2019bert and (b) MobileBERT sun2020mobilebert at different epochs during training.
  • Figure 3: (a) CKA comparison among different heads of layer 6 in the searched architecture. (b) average cosine similarity between features of blocks trained with MSE or RKD in each layer.
  • Figure 4: Comparison of GPT2-Base and our Chat-ELM-Small trained with MiniLLM on chat task.
  • Figure 5: The individual architectures of ELM-Small/Tiny/Micro are displayed from top to bottom. H represents the number of heads and d represents the hidden dimensions, while QV/KV/no share indicate the different ways of weight sharing.
  • ...and 1 more figures