Table of Contents
Fetching ...

Floe: Federated Specialization for Real-Time LLM-SLM Inference

Chunlin Tian, Kahou Tam, Yebo Wu, Shuaihang Zhong, Li Li, Nicholas D. Lane, Chengzhong Xu

TL;DR

Floe tackles the challenge of real-time, privacy-preserving LLM inference in edge environments by fusing a cloud-hosted black-box LLM with on-device lightweight SLMs through a federated, parameter-efficient fine-tuning workflow. It introduces heterogeneity-aware adaptive LoRA adapters, a task-specific clustering-and-aggregation mechanism, and a two-layer inference pipeline with a privacy detector, a logit-level alignment, and a parameter-free MoE router to coordinate edge and cloud reasoning. Theoretical convergence guarantees are provided for the clustered LoRA updates under standard FL assumptions, and extensive experiments demonstrate improved accuracy on multi-task benchmarks, substantial latency and energy savings, and strong privacy retention, across open-source and proprietary models. The results indicate Floe's practical potential for private, low-latency, personalized LLM deployment on diverse edge devices in real-world settings.

Abstract

Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.

Floe: Federated Specialization for Real-Time LLM-SLM Inference

TL;DR

Floe tackles the challenge of real-time, privacy-preserving LLM inference in edge environments by fusing a cloud-hosted black-box LLM with on-device lightweight SLMs through a federated, parameter-efficient fine-tuning workflow. It introduces heterogeneity-aware adaptive LoRA adapters, a task-specific clustering-and-aggregation mechanism, and a two-layer inference pipeline with a privacy detector, a logit-level alignment, and a parameter-free MoE router to coordinate edge and cloud reasoning. Theoretical convergence guarantees are provided for the clustered LoRA updates under standard FL assumptions, and extensive experiments demonstrate improved accuracy on multi-task benchmarks, substantial latency and energy savings, and strong privacy retention, across open-source and proprietary models. The results indicate Floe's practical potential for private, low-latency, personalized LLM deployment on diverse edge devices in real-world settings.

Abstract

Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.
Paper Structure (36 sections, 1 theorem, 15 equations, 16 figures, 5 tables, 2 algorithms)

This paper contains 36 sections, 1 theorem, 15 equations, 16 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

(Convergence). Under Assumptions 1-4, if $\eta \le \frac{1}{30LE}$, the average squared gradient norm after $T$ rounds is bounded by:

Figures (16)

  • Figure 1: Comparison of existing LM fine-tuning and inference approaches with Floe. Convention: (1.a) End-user fine-tunes LLM directly at the edge, breaking model copyright and facing memory constraints. (1.b) End-user fine-tunes specialized SLM at the edge is personalized but underperforms. (2.a) Inference by cloud LLM outperforms but risks privacy. (2.b) Inference by specialized edge SLM is privacy-friendly but underperforms. Floe: (3.a) Federated SLMs fine-tuning cross-edge devices with data sufficiency and privacy. (3.b) Harmonizing LLMs and SLMs improves privacy and performance.
  • Figure 2: LLMs inference process. The logits play a crucial role in determining the final output.
  • Figure 3: Model performance on in-domain dataset across various Non-IID levels. (Model: Llama-7B; Dataset: CodeAlpaca; Benchmark: Humaneval)
  • Figure 4: Latency during fine-tuning with different LoRA rank on different edge devices. (Tiny-LLaMA on GSM-8K)
  • Figure 5: Task embedding heatmap.
  • ...and 11 more figures

Theorems & Definitions (1)

  • Theorem 1