Table of Contents
Fetching ...

You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models

Shiwei Ding, Lan Zhang, Zhenlin Wang, Giuseppe Ateniese, Xiaoyong Yuan

TL;DR

D2FT tackles the practical challenge of fine-tuning foundation models on memory-constrained hardware by dynamically selecting which attention-subnet operations to execute across distributed devices. It formulates the orchestration as a multi-knapsack problem, solved with a two-stage heuristic and DP-based scheduling, and extends naturally to LoRA PEFT. The method achieves up to $40\%$ reductions in training compute and $50\%$ reductions in training communication with minimal accuracy loss on CIFAR-10/100 and Stanford Cars, and preserves much of that efficiency when applied to LoRA with $4\%$–$6\%$ accuracy degradation at similar costs. D2FT demonstrates robustness to device heterogeneity and offers a practical route to efficient, scalable fine-tuning of large transformer models in resource-limited environments.

Abstract

Fine-tuning plays a crucial role in adapting models to downstream tasks with minimal training efforts. However, the rapidly increasing size of foundation models poses a daunting challenge for accommodating foundation model fine-tuning in most commercial devices, which often have limited memory bandwidth. Techniques like model sharding and tensor parallelism address this issue by distributing computation across multiple devices to meet memory requirements. Nevertheless, these methods do not fully leverage their foundation nature in facilitating the fine-tuning process, resulting in high computational costs and imbalanced workloads. We introduce a novel Distributed Dynamic Fine-Tuning (D2FT) framework that strategically orchestrates operations across attention modules based on our observation that not all attention modules are necessary for forward and backward propagation in fine-tuning foundation models. Through three innovative selection strategies, D2FT significantly reduces the computational workload required for fine-tuning foundation models. Furthermore, D2FT addresses workload imbalances in distributed computing environments by optimizing these selection strategies via multiple knapsack optimization. Our experimental results demonstrate that the proposed D2FT framework reduces the training computational costs by 40% and training communication costs by 50% with only 1% to 2% accuracy drops on the CIFAR-10, CIFAR-100, and Stanford Cars datasets. Moreover, the results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique. By reducing 40% computational cost or 50% communication cost, D2FT LoRA top-1 accuracy only drops 4% to 6% on Stanford Cars dataset.

You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models

TL;DR

D2FT tackles the practical challenge of fine-tuning foundation models on memory-constrained hardware by dynamically selecting which attention-subnet operations to execute across distributed devices. It formulates the orchestration as a multi-knapsack problem, solved with a two-stage heuristic and DP-based scheduling, and extends naturally to LoRA PEFT. The method achieves up to reductions in training compute and reductions in training communication with minimal accuracy loss on CIFAR-10/100 and Stanford Cars, and preserves much of that efficiency when applied to LoRA with accuracy degradation at similar costs. D2FT demonstrates robustness to device heterogeneity and offers a practical route to efficient, scalable fine-tuning of large transformer models in resource-limited environments.

Abstract

Fine-tuning plays a crucial role in adapting models to downstream tasks with minimal training efforts. However, the rapidly increasing size of foundation models poses a daunting challenge for accommodating foundation model fine-tuning in most commercial devices, which often have limited memory bandwidth. Techniques like model sharding and tensor parallelism address this issue by distributing computation across multiple devices to meet memory requirements. Nevertheless, these methods do not fully leverage their foundation nature in facilitating the fine-tuning process, resulting in high computational costs and imbalanced workloads. We introduce a novel Distributed Dynamic Fine-Tuning (D2FT) framework that strategically orchestrates operations across attention modules based on our observation that not all attention modules are necessary for forward and backward propagation in fine-tuning foundation models. Through three innovative selection strategies, D2FT significantly reduces the computational workload required for fine-tuning foundation models. Furthermore, D2FT addresses workload imbalances in distributed computing environments by optimizing these selection strategies via multiple knapsack optimization. Our experimental results demonstrate that the proposed D2FT framework reduces the training computational costs by 40% and training communication costs by 50% with only 1% to 2% accuracy drops on the CIFAR-10, CIFAR-100, and Stanford Cars datasets. Moreover, the results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique. By reducing 40% computational cost or 50% communication cost, D2FT LoRA top-1 accuracy only drops 4% to 6% on Stanford Cars dataset.

Paper Structure

This paper contains 27 sections, 7 equations, 3 figures, 10 tables, 2 algorithms.

Figures (3)

  • Figure 1: (Full parameter) Fine-tuning performance comparison. The proposed D2FT framework outperforms existing efficient distributed learning frameworks under similar computation and communication costs on the CIFAR-100 and Stanford Cars datasets. The "Standard" means the model standard full fine-tuning.
  • Figure 2: The top-1 accuracy comparison under the same or similar computation and communication costs on CIFAR-10 datasets. The "Standard" means the standard full fine-tuning.
  • Figure 3: Comparison of LoRA fine-tuning. We present the top-1 accuracy under the same or similar computational and communications costs on the Stanford Cars dataset. The "Standard LoRA" shows standard LoRA fine-tuning performance with the same rank as D2FT. The "LoRA w/ small rank" shows standard LoRA fine-tuning performance with a smaller rank to match D2FT's computational costs.