Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschläger; Alexander Isenko; Shiqiang Wang; Ruben Mayer; Hans-Arno Jacobsen

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, Hans-Arno Jacobsen

TL;DR

The paper addresses enabling federated fine-tuning of large language models at the network edge under privacy and resource constraints. It adopts a hardware-centric methodology, evaluating FLAN-T5 models from $80\mathrm{M}$ to $3\mathrm{B}$ parameters on Jetson AGX Orin edge devices and comparing against an NVIDIA A100 data-center GPU, using LoRA for parameter-efficient fine-tuning. It introduces energy-efficiency metrics ($\eta_e = \frac{\mathrm{TPS}}{W}$) and Granularity $G = \frac{T_{\mathrm{comp}}}{T_{\mathrm{comm}}}$ to quantify edge FL performance, and compares four optimizers (FedAvg, FedAvgM, FedAdam, FedAdamW) with findings that FedAdamW improves convergence while communication remains a major energy sink, especially at the edge. The study reveals edge memory bandwidth bottlenecks, the strong role of PEFT in improving scalability, and the regulatory imperative for energy-aware FL, outlining concrete steps toward more practical edge-enabled foundation-model training.

Abstract

Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

TL;DR

parameters on Jetson AGX Orin edge devices and comparing against an NVIDIA A100 data-center GPU, using LoRA for parameter-efficient fine-tuning. It introduces energy-efficiency metrics (

) and Granularity

to quantify edge FL performance, and compares four optimizers (FedAvg, FedAvgM, FedAdam, FedAdamW) with findings that FedAdamW improves convergence while communication remains a major energy sink, especially at the edge. The study reveals edge memory bandwidth bottlenecks, the strong role of PEFT in improving scalability, and the regulatory imperative for energy-aware FL, outlining concrete steps toward more practical edge-enabled foundation-model training.

Abstract

Paper Structure (24 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Background
Performance Objectives in Data Center Environments
Performance Objectives on the Edge
Regulatory Requirements with Regard to Energy Efficiency
Methodology
Computational Efficiency
Energy Efficiency
Communication Efficiency
Model Performance
Experimental Setup
Results
Computational & Energy Efficiency
Model Performance
Communication Efficiency
...and 9 more sections

Figures (6)

Figure 1: Development of computational power and resource availability of DL accelerators 2017 - 2023 for data centers and embedded systems. Key: RPi4 = Raspberry Pi 4, Nano = NVIDIA Jetson Nano, Orin Nano = NVIDIA Jetson Orin Nano, AGX Orin = NVIDIA Jetson AGX Orin 64 GB.
Figure 2: DL training step times across FLAN-T5 transformer models with varying minibatch sizes on the Samsum dataset running on the NVIDIA A100 and Jetson AGX Orin platform. Detailed metrics are available in Appendix \ref{['app:results']}.
Figure 3: Our NVIDIA Jetson AGX Orin 64GB Testbed. 10 devices with freely configurable network interconnect up to 10 Gbit. Active external cooling is a must at the given energy density of $10 \cdot 60W$ max. power draw.
Figure 4: We study the model FLOP utilization (MFU) and the energy efficiency ($\eta_e$) of the FLAN-T5 transformer model family and find a strong correlation between the MFU and $\eta_e$, which is useful to evaluate root causes for poor training speeds in real-time.
Figure 5: We show the effectiveness of Federated AdamW by training the FLAN-T5 Model family in a federated setup with 100 clients (10 clients per round). We report the validation loss (left) and the Rouge-1 score (right) as performance indicators. Note: The loss spikes for FLAN-T5 Large originate from an increased sensitivity of LoRA adapters with a large parameter count to non-IID data Babakniya2023.
...and 1 more figures

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

TL;DR

Abstract

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Authors

TL;DR

Abstract

Table of Contents

Figures (6)