Table of Contents
Fetching ...

EfficientLLM: Efficiency in Large Language Models

Zhengqing Yuan, Weixiang Sun, Yixin Liu, Huichi Zhou, Rong Zhou, Yiyang Li, Zheyuan Zhang, Wei Song, Yue Huang, Haolong Jia, Keerthiram Murugesan, Yu Wang, Lifang He, Jianfeng Gao, Lichao Sun, Yanfang Ye

TL;DR

EfficientLLM presents the first large-scale, empirical benchmark to evaluate efficiency techniques for LLMs across architecture pretraining, fine-tuning, and inference on modern GPU clusters. The study systematically compares efficient attention variants, sparse MoE, PEFT methods, and quantization, introducing fine-grained metrics that capture memory, compute, latency, throughput, energy, and compression. Key findings reveal that no single technique dominates across all axes, that optima are task- and scale-dependent, and that many efficiency methods generalize to vision and multimodal models; int4 quantization offers substantial resource savings with modest accuracy loss. The work provides practical, data-driven guidance for researchers and engineers to navigate the efficiency-performance landscape of next-generation foundation models, and it open-sources datasets, pipelines, and leaderboards to foster reproducibility and broader adoption.

Abstract

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

EfficientLLM: Efficiency in Large Language Models

TL;DR

EfficientLLM presents the first large-scale, empirical benchmark to evaluate efficiency techniques for LLMs across architecture pretraining, fine-tuning, and inference on modern GPU clusters. The study systematically compares efficient attention variants, sparse MoE, PEFT methods, and quantization, introducing fine-grained metrics that capture memory, compute, latency, throughput, energy, and compression. Key findings reveal that no single technique dominates across all axes, that optima are task- and scale-dependent, and that many efficiency methods generalize to vision and multimodal models; int4 quantization offers substantial resource savings with modest accuracy loss. The work provides practical, data-driven guidance for researchers and engineers to navigate the efficiency-performance landscape of next-generation foundation models, and it open-sources datasets, pipelines, and leaderboards to foster reproducibility and broader adoption.

Abstract

Large Language Models (LLMs) have driven significant progress, yet their growing parameter counts and context windows incur prohibitive compute, energy, and monetary costs. We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale. Conducted on a production-class cluster (48xGH200, 8xH200 GPUs), our study systematically explores three key axes: (1) architecture pretraining (efficient attention variants: MQA, GQA, MLA, NSA; sparse Mixture-of-Experts (MoE)), (2) fine-tuning (parameter-efficient methods: LoRA, RSLoRA, DoRA), and (3) inference (quantization methods: int4, float16). We define six fine-grained metrics (Memory Utilization, Compute Utilization, Latency, Throughput, Energy Consumption, Compression Rate) to capture hardware saturation, latency-throughput balance, and carbon cost. Evaluating over 100 model-technique pairs (0.5B-72B parameters), we derive three core insights: (i) Efficiency involves quantifiable trade-offs: no single method is universally optimal; e.g., MoE reduces FLOPs and improves accuracy but increases VRAM by 40%, while int4 quantization cuts memory/energy by up to 3.9x at a 3-5% accuracy drop. (ii) Optima are task- and scale-dependent: MQA offers optimal memory-latency trade-offs for constrained devices, MLA achieves lowest perplexity for quality-critical tasks, and RSLoRA surpasses LoRA efficiency only beyond 14B parameters. (iii) Techniques generalize across modalities: we extend evaluations to Large Vision Models (Stable Diffusion 3.5, Wan 2.1) and Vision-Language Models (Qwen2.5-VL), confirming effective transferability. By open-sourcing datasets, evaluation pipelines, and leaderboards, EfficientLLM provides essential guidance for researchers and engineers navigating the efficiency-performance landscape of next-generation foundation models.

Paper Structure

This paper contains 59 sections, 26 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Overview of the EfficientLLM framework.
  • Figure 2: Ranking of LLM training and inference efficiency and performance across various techniques. The chart compares attention mechanisms, MoE designs, and architecture types (top block), parameter-efficient fine-tuning methods (middle block), and quantization strategies (bottom block) across eight dimensions: performance, utilization (AMU, PCU), latency (AL, TT), throughput (ST, IT, TT), energy consumption (AEC), and compression (MCR). For parameter-efficient tuning, "Freeze" refers to the method, which freezes the frist 8 layers of the model. Methods marked with an asterisk ($^*$), such as "Full$^*$", utilize DeepSpeed ZeRO-3.
  • Figure 3: Efficiency LLM Results. This figure illustrates the performance and efficiency trade-offs of various architectural improvements for LLMs. (a) Radar charts comparing different Efficient Attention Mechanisms (MQA, GQA, MLA, and NSA) across 0.5B, 1.5B, and 3B model parameters, evaluated on Perplexity (PPL), Average Memory Utilization (AMU), Average Latency (AL), Tokens Throughput (TT), and Average Energy Consumption (AEC). (b) Bar chart assessing Efficient Positional Encoding methods (RoPE, Absolute, Learnable Absolute, Relate, and None) for a 1.5B parameter model on the same five key metrics. (c) Bubble chart comparing Dense Models with Mixture-of-Experts (MoE) Models of varying parameter sizes, highlighting differences in PPL, AMU, AL, TT, and AEC. These visualizations correspond to the detailed results presented in Tables 4, 5, and 6. Note: All metrics presented in this figure are normalized.
  • Figure 4: Assessment of training and fine-tuning efficiency across multiple LLMs. (a) Comparison of different fine-tuning methods (LoRA, LoRA-plus, RSLoRA, DoRA, PISSA, Freeze, and full fine-tuning using DeepSpeed) across seven model architectures (Llama-3.2-1B/3B, Llama-3.1-8B, Qwen-2.5-7B/14B, Mistral-Small-24B, and Mistral-7B) using the O1-SFT dataset. Each bar shows the corresponding Efficiency Score (higher is better) and Loss (lower is better). The Efficiency Score is computed as a weighted harmonic combination of normalized resource metrics. Methods marked with * denote full fine-tuning using DeepSpeed.
  • Figure 5: Assessment of quantization-based inference efficiency across model precisions. Radar plots compare normalized efficiency metrics across three quantization formats: bfloat16, float16, and int4. Each plot evaluates models from DeepSeek, Qwen, Phi, and Yi families using six normalized metrics (all $\uparrow$ higher is better): average task performance, inference throughput (IT), average memory utilization (AMU), sum latency (Sum AL), average energy consumption (AEC), and model compression ratio (MCR). All values are normalized as deilted in Section \ref{['Min-max']}. While bfloat16 typically yields higher performance scores, int4 excels in throughput, memory, and compression, indicating its efficiency in deployment-constrained environments.
  • ...and 2 more figures