Table of Contents
Fetching ...

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain

TL;DR

The paper tackles the resource burden of fine-tuning and deploying large language models by marrying parameter-efficient low-rank adapters with neural architecture search (NAS) using weight-sharing super-networks. It surveys elastic LoRA adapters (Mode A and Mode B) and NAS-guided approaches (LoNAS), along with extensions (Shears, SQFT) that handle sparsity and low-precision constraints. Empirical results show that elastic adapters can guide NAS to smaller, faster sub-architectures with minimal accuracy loss, achieving up to around 80% parameter reduction and up to $1.4\times$ inference speedups, at the cost of increased fine-tuning overhead for some methods. The work also introduces strategies like SparsePEFT and QA-SparsePEFT to maintain alignment when merging adapters with sparse and quantized bases, broadening the applicability of PEFT in constrained environments and providing code resources at the IntelLabs GitHub repository.

Abstract

The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

TL;DR

The paper tackles the resource burden of fine-tuning and deploying large language models by marrying parameter-efficient low-rank adapters with neural architecture search (NAS) using weight-sharing super-networks. It surveys elastic LoRA adapters (Mode A and Mode B) and NAS-guided approaches (LoNAS), along with extensions (Shears, SQFT) that handle sparsity and low-precision constraints. Empirical results show that elastic adapters can guide NAS to smaller, faster sub-architectures with minimal accuracy loss, achieving up to around 80% parameter reduction and up to inference speedups, at the cost of increased fine-tuning overhead for some methods. The work also introduces strategies like SparsePEFT and QA-SparsePEFT to maintain alignment when merging adapters with sparse and quantized bases, broadening the applicability of PEFT in constrained environments and providing code resources at the IntelLabs GitHub repository.

Abstract

The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Paper Structure

This paper contains 13 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Vanilla LoRA Adapter and two different modes of the elastic adapter. Mode A allows only the LoRA rank to be elastic, while Mode B also enables the input or output channels to be elastic.
  • Figure 2: Elastic adapters guide the removal of elements in the frozen model weights, resulting in smaller, high-performing models. This process exemplifies the application of Mode B as depicted in Figure \ref{['fig:elastic-adapter']}.
  • Figure 3: Elastic low-Rank adapters for fine-tuning sparse efficient models. This style exemplifies the application of Mode A as depicted in Figure \ref{['fig:elastic-adapter']}.
  • Figure 4: Search progression to discover Pareto-optimal low-rank adapter configurations. The horizontal line represents the zero-shot accuracy of the midpoint heuristic sub-adapter.