Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

J. Pablo Muñoz; Jinjie Yuan; Nilesh Jain

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain

TL;DR

The paper tackles the resource burden of fine-tuning and deploying large language models by marrying parameter-efficient low-rank adapters with neural architecture search (NAS) using weight-sharing super-networks. It surveys elastic LoRA adapters (Mode A and Mode B) and NAS-guided approaches (LoNAS), along with extensions (Shears, SQFT) that handle sparsity and low-precision constraints. Empirical results show that elastic adapters can guide NAS to smaller, faster sub-architectures with minimal accuracy loss, achieving up to around 80% parameter reduction and up to $1.4\times$ inference speedups, at the cost of increased fine-tuning overhead for some methods. The work also introduces strategies like SparsePEFT and QA-SparsePEFT to maintain alignment when merging adapters with sparse and quantized bases, broadening the applicability of PEFT in constrained environments and providing code resources at the IntelLabs GitHub repository.

Abstract

The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

TL;DR

Abstract

Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)