LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

Anthony Sarah; Sharath Nittur Sridhar; Maciej Szankin; Sairam Sundaresan

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan

TL;DR

This work tackles the high resource demands of large language models by applying one-shot Neural Architecture Search to LLaMA2-7B to automatically discover Pareto-optimal sub-networks that balance model size and accuracy. By fine-tuning a super-network derived from InstaTune and running a LINAS-based multi-objective search, the approach yields smaller, faster architectures with negligible accuracy loss across diverse tasks. The method outperforms traditional pruning/sparsification and synergizes with standard INT8 quantization, enabling deployment on less capable hardware without specialized kernels. The results demonstrate task-specific architectural signals (layer counts and intermediate sizes) and show practical, automatic compression that maintains or improves performance on ARC, MMLU, TruthfulQA MC1, and WinoGrande. This work facilitates broader accessibility of LLM capabilities on cost-constrained hardware with minimal retraining requirements.

Abstract

The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

TL;DR

Abstract

Paper Structure (24 sections, 9 figures, 5 tables)

This paper contains 24 sections, 9 figures, 5 tables.

Introduction
Methods
Search Method
Search Space
Evaluation
Hyper-Parameters
Tasks
AI2 Reasoning Challenge
Massive Multitask Language Understanding
TruthfulQA
WinoGrande
Results
Search Analysis
AI2 Reasoning Challenge
Massive Multitask Language Understanding
...and 9 more sections

Figures (9)

Figure 1: Pareto fronts after applying our method to search for optimal sub-network architectures in the model size / ARC-c accuracy (left) and model size / ARC-e accuracy (right) objective spaces. The red dot indicates the model size and accuracy of the pre-trained LLaMA2-7B network from touvron2023llama.
Figure 2: Pareto fronts after applying our method to search for optimal sub-networks with the MMLU task. The left Pareto front is in the model size / MMLU accuracy objective space while the right Pareto front is in the throughput / MMLU accuracy objective space. Throughput is evaluated using a single NVIDIA TitanV GPU with the red dot indicating the model size and accuracy of the pre-trained LLaMA2-7B network from touvron2023llama.
Figure 3: Pareto front after applying our work to Alpaca-fine-tuned LLaMA2-7B in the model size / TruthfulQA MC1 accuracy objective space. The red dot indicates the pre-trained LLaMA2-7B network using the weights from https://huggingface.co/meta-llama/Llama-2-7b.
Figure 4: Pareto front after applying our work to Alpaca-fine-tuned LLaMA2-7B in the model size / WinoGrande accuracy objective space. The red dot indicates the model size and accuracy of the pre-trained LLaMA2-7B network from touvron2023llama.
Figure 5: Pareto fronts before and after applying INT8 quantization to Alpaca-fine-tuned LLaMA2-7B in the model size / accuracy objective spaces. The blue lines are the quantized (INT8) Pareto front while the green lines are original non-quantized (FP16) Pareto front also shown in Figures \ref{['fig:arc_pareto_fronts']} through \ref{['fig:winogrande_pareto_front']}. The red dots indicate the model size and accuracy of the pre-trained, non-quantized LLaMA2-7B network from touvron2023llama.
...and 4 more figures

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

TL;DR

Abstract

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)