Table of Contents
Fetching ...

A Study of Optimizations for Fine-tuning Large Language Models

Arjun Singh, Nikhil Pandey, Anup Shirgaonkar, Pavan Manoj, Vijay Aski

TL;DR

This work tackles the memory bottlenecks in fine-tuning very large language models by empirically evaluating a set of optimization techniques—Gradient Checkpointing, LoRA, DeepSpeed ZeRO, and FlashAttention—and their interactions. It demonstrates that ZeRO-2 combined with LoRA offers the best default memory-time balance for a range of model sizes, while ZeRO-3 with LoRA and gradient checkpointing is essential for ultra-large models, especially under resource limits. A theoretical memory model is developed to estimate GPU memory usage and guide resource planning, and long-context fine-tuning is shown to benefit significantly from FlashAttention-2 on compatible GPUs. The results provide practical guidance for enterprise fine-tuning, including recommended defaults, strategies for long-context data, and approaches for constrained hardware, with future work pointing to quantization, smaller model regimes, and much longer context lengths.

Abstract

Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low-Rank Adaptation, DeepSpeed's Zero Redundancy Optimizer and FlashAttention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide our recommendation on the best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.

A Study of Optimizations for Fine-tuning Large Language Models

TL;DR

This work tackles the memory bottlenecks in fine-tuning very large language models by empirically evaluating a set of optimization techniques—Gradient Checkpointing, LoRA, DeepSpeed ZeRO, and FlashAttention—and their interactions. It demonstrates that ZeRO-2 combined with LoRA offers the best default memory-time balance for a range of model sizes, while ZeRO-3 with LoRA and gradient checkpointing is essential for ultra-large models, especially under resource limits. A theoretical memory model is developed to estimate GPU memory usage and guide resource planning, and long-context fine-tuning is shown to benefit significantly from FlashAttention-2 on compatible GPUs. The results provide practical guidance for enterprise fine-tuning, including recommended defaults, strategies for long-context data, and approaches for constrained hardware, with future work pointing to quantization, smaller model regimes, and much longer context lengths.

Abstract

Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low-Rank Adaptation, DeepSpeed's Zero Redundancy Optimizer and FlashAttention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide our recommendation on the best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.
Paper Structure (15 sections, 2 equations, 4 figures, 2 tables)

This paper contains 15 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Model state memory for a model with $\theta$ parameters, when fine-tuned using Adam optimizer under mixed-precision setting. Model state comprises of optimizer state, gradients and model parameters. Total model state memory, with no optimization enabled, adds upto 16$\theta$ bytes.
  • Figure 2: GPU memory usage and fine-tuning runtime for different optimization configurations across ZeRO-1, ZeRO-2 and ZeRO-3 for Llama 2 7B. Using LoRA with ZeRO-2 provides the best balance between memory usage and runtime.
  • Figure 3: Impact of varying context length on GPU memory usage and fine-tuning time with and without FlashAttention-2 for Llama 2 70B. Enabling FlashAttention-2 on A100s significantly lowers the memory consumption and runtime for larger context lengths such as 4096.
  • Figure 4: Optimal configurations for fine-tuning LLMs of different sizes when using V100 GPUs. All Llama 2 experiments were run using 8xV100s, whereas Falcon 180B required 16xV100s. FlashAttention-2 is omitted as it is not supported on V100s.