Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly
Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, Hans-Arno Jacobsen
TL;DR
The paper addresses enabling federated fine-tuning of large language models at the network edge under privacy and resource constraints. It adopts a hardware-centric methodology, evaluating FLAN-T5 models from $80\mathrm{M}$ to $3\mathrm{B}$ parameters on Jetson AGX Orin edge devices and comparing against an NVIDIA A100 data-center GPU, using LoRA for parameter-efficient fine-tuning. It introduces energy-efficiency metrics ($\eta_e = \frac{\mathrm{TPS}}{W}$) and Granularity $G = \frac{T_{\mathrm{comp}}}{T_{\mathrm{comm}}}$ to quantify edge FL performance, and compares four optimizers (FedAvg, FedAvgM, FedAdam, FedAdamW) with findings that FedAdamW improves convergence while communication remains a major energy sink, especially at the edge. The study reveals edge memory bandwidth bottlenecks, the strong role of PEFT in improving scalability, and the regulatory imperative for energy-aware FL, outlining concrete steps toward more practical edge-enabled foundation-model training.
Abstract
Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.
