Table of Contents
Fetching ...

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Kilian Pfeiffer, Mohamed Aboelenien Ahmed, Ramin Khalili, Jörg Henkel

TL;DR

This work addresses training pretrained tiny Transformers in resource-constrained cross-device Federated Learning. It introduces a layer-finetuning approach coupled with an NN-architecture selection mechanism that chooses, for each device, how many layers to train from the suffix of a pretrained architecture, while freezing the rest, and aggregates updates per-layer to the global model. By explicitly modeling memory, upload, and compute footprints and optimizing over feasible configurations, the method achieves higher accuracy and fairness than LoRA and several FL baselines across language and vision tasks. The results demonstrate practical viability for deploying private, efficient FL systems on heterogeneous edge devices without sacrificing performance.

Abstract

In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

TL;DR

This work addresses training pretrained tiny Transformers in resource-constrained cross-device Federated Learning. It introduces a layer-finetuning approach coupled with an NN-architecture selection mechanism that chooses, for each device, how many layers to train from the suffix of a pretrained architecture, while freezing the rest, and aggregates updates per-layer to the global model. By explicitly modeling memory, upload, and compute footprints and optimizing over feasible configurations, the method achieves higher accuracy and fairness than LoRA and several FL baselines across language and vision tasks. The results demonstrate practical viability for deploying private, efficient FL systems on heterogeneous edge devices without sacrificing performance.

Abstract

In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.

Paper Structure

This paper contains 13 sections, 8 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of downstream performance and resource requirements for next-token prediction on Shakespeare caldas1812leaf using layer finetuning (with layers frozen from first to last, where each dot represents a specific number of layers being frozen) and LoRA with tiny Transformers pretrained on OpenWebText, having 3, 6, and 9 layers. We observe that while LoRA (with ranks 24, 12, 3) can achieve gains in communication efficiency, it requires significantly more peak memory and FLOP to reach the same accuracy. Hyperparameters and details are provided in \ref{['subsec:hyperparameters']}. GPT2 is included to highlight inference costs (accuracy is based on GPT2's tokenizer and test data may be part of GPT2's training set).
  • Figure 2: We propose FL technique for cross-device FL with heterogeneous devices that incorporates the availability of pretrained models. Unlike previous approaches, our technique retains some layers in their pretrained state.
  • Figure 3: Ablation study of NN selection (\ref{['eq:min2']}). We observe that picking the NN that maximizes the average of trained layers maximizes the accuracy (blue).
  • Figure 4: Visualization of homogeneous results for Shakespeare and CIFAR100. For OURS (ours), we apply NN selection based on \ref{['eq:feasible', 'eq:min2']}. For Heterog. LoRA and FedHM, we evaluate all NN architectures and, for a given constraint, present the best performing in \ref{['tab:results']}.
  • Figure 5: Memory components of layer fintuning and LoRA for language models (right) and VIT (left) with 3, 6, and 9 layers.