A Study of Optimizations for Fine-tuning Large Language Models
Arjun Singh, Nikhil Pandey, Anup Shirgaonkar, Pavan Manoj, Vijay Aski
TL;DR
This work tackles the memory bottlenecks in fine-tuning very large language models by empirically evaluating a set of optimization techniques—Gradient Checkpointing, LoRA, DeepSpeed ZeRO, and FlashAttention—and their interactions. It demonstrates that ZeRO-2 combined with LoRA offers the best default memory-time balance for a range of model sizes, while ZeRO-3 with LoRA and gradient checkpointing is essential for ultra-large models, especially under resource limits. A theoretical memory model is developed to estimate GPU memory usage and guide resource planning, and long-context fine-tuning is shown to benefit significantly from FlashAttention-2 on compatible GPUs. The results provide practical guidance for enterprise fine-tuning, including recommended defaults, strategies for long-context data, and approaches for constrained hardware, with future work pointing to quantization, smaller model regimes, and much longer context lengths.
Abstract
Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low-Rank Adaptation, DeepSpeed's Zero Redundancy Optimizer and FlashAttention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide our recommendation on the best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.
