Floe: Federated Specialization for Real-Time LLM-SLM Inference
Chunlin Tian, Kahou Tam, Yebo Wu, Shuaihang Zhong, Li Li, Nicholas D. Lane, Chengzhong Xu
TL;DR
Floe tackles the challenge of real-time, privacy-preserving LLM inference in edge environments by fusing a cloud-hosted black-box LLM with on-device lightweight SLMs through a federated, parameter-efficient fine-tuning workflow. It introduces heterogeneity-aware adaptive LoRA adapters, a task-specific clustering-and-aggregation mechanism, and a two-layer inference pipeline with a privacy detector, a logit-level alignment, and a parameter-free MoE router to coordinate edge and cloud reasoning. Theoretical convergence guarantees are provided for the clustered LoRA updates under standard FL assumptions, and extensive experiments demonstrate improved accuracy on multi-task benchmarks, substantial latency and energy savings, and strong privacy retention, across open-source and proprietary models. The results indicate Floe's practical potential for private, low-latency, personalized LLM deployment on diverse edge devices in real-world settings.
Abstract
Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.
