LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models
Anthony Sarah, Sharath Nittur Sridhar, Maciej Szankin, Sairam Sundaresan
TL;DR
This work tackles the high resource demands of large language models by applying one-shot Neural Architecture Search to LLaMA2-7B to automatically discover Pareto-optimal sub-networks that balance model size and accuracy. By fine-tuning a super-network derived from InstaTune and running a LINAS-based multi-objective search, the approach yields smaller, faster architectures with negligible accuracy loss across diverse tasks. The method outperforms traditional pruning/sparsification and synergizes with standard INT8 quantization, enabling deployment on less capable hardware without specialized kernels. The results demonstrate task-specific architectural signals (layer counts and intermediate sizes) and show practical, automatic compression that maintains or improves performance on ARC, MMLU, TruthfulQA MC1, and WinoGrande. This work facilitates broader accessibility of LLM capabilities on cost-constrained hardware with minimal retraining requirements.
Abstract
The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.
