Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

Haoran Li; Xinyuan Zhao; Dadi Guo; Hanlin Gu; Ziqian Zeng; Yuxing Han; Yangqiu Song; Lixin Fan; Qiang Yang

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

Haoran Li, Xinyuan Zhao, Dadi Guo, Hanlin Gu, Ziqian Zeng, Yuxing Han, Yangqiu Song, Lixin Fan, Qiang Yang

TL;DR

The paper tackles privacy constraints in federated LLM knowledge transfer by enabling a server LLM to augment a client’s domain-specific SLMs without exposing private data. It introduces FDKT, which uses a differentially private generator to create DP-sanitized synthetic data conditioned on private demonstrations, followed by server-side clustering, filtering, and in-context augmentation to align the transferred knowledge with the client distribution. Empirical results across multiple domains show that FDKT yields consistent, substantial improvements over baselines, especially under tight privacy budgets ($\epsilon<10$), and extends naturally to one-to-many, multi-task settings. The approach offers a practical pathway for privacy-preserving, domain-specific knowledge transfer in real-world, sensitive applications.

Abstract

As large language models (LLMs) demonstrate unparalleled performance and generalization ability, LLMs are widely used and integrated into various applications. When it comes to sensitive domains, as commonly described in federated learning scenarios, directly using external LLMs on private data is strictly prohibited by stringent data security and privacy regulations. For local clients, the utilization of LLMs to improve the domain-specific small language models (SLMs), characterized by limited computational resources and domain-specific data, has attracted considerable research attention. By observing that LLMs can empower domain-specific SLMs, existing methods predominantly concentrate on leveraging the public data or LLMs to generate more data to transfer knowledge from LLMs to SLMs. However, due to the discrepancies between LLMs' generated data and clients' domain-specific data, these methods cannot yield substantial improvements in the domain-specific tasks. In this paper, we introduce a Federated Domain-specific Knowledge Transfer (FDKT) framework, which enables domain-specific knowledge transfer from LLMs to SLMs while preserving clients' data privacy. The core insight is to leverage LLMs to augment data based on domain-specific few-shot demonstrations, which are synthesized from private domain data using differential privacy. Such synthetic samples share similar data distribution with clients' private data and allow the server LLM to generate particular knowledge to improve clients' SLMs. The extensive experimental results demonstrate that the proposed FDKT framework consistently and greatly improves SLMs' task performance by around 5\% with a privacy budget of less than 10, compared to local training on private data.

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

TL;DR

), and extends naturally to one-to-many, multi-task settings. The approach offers a practical pathway for privacy-preserving, domain-specific knowledge transfer in real-world, sensitive applications.

Abstract

Paper Structure (32 sections, 3 equations, 2 figures, 9 tables)

This paper contains 32 sections, 3 equations, 2 figures, 9 tables.

Introduction
Preliminaries
Federated Learning on LLMs
Differential Privacy
DP-tuned LMs
Federated Domain-Specific Knowledge Transfer
Problem Formulation
Client-side Synthetic Data Generation
Sever-side Knowledge Transfer
High-quality Data Filtering Mechanism
In-context Data Augmentation
Local SLM Fine-tuning
Extending FDKT to One-to-many Scenario with Multiple Clients
Experiments
Experimental Setups
...and 17 more sections

Figures (2)

Figure 1: Overview of FDKT's selective knowledge transfer pipeline. The left subfigure illustrates the workflow of FDKT for enhancing individual client performance, while the right subpart depicts how FDKT facilitates federated training across multiple clients for multi-task learning. The yellow region is under the control of the server and the rest part belongs to the client. The rose lines involve interactions with private data $D$. In contrast, blue lines represent interactions that do not disclose $D$. In all interactions between FDKT's client and server, only synthetic data $D'$ is exchanged to facilitate knowledge transfer. In the right subpart, the multi-task prefix processor adds task-dependent prefixes to each client's augmented data to train multi-task SLMs.
Figure 2: Evaluation of FDKT for the one-to-many scenario. In-domain Local FT denotes the SLM is fine-tuned and evaluated within the same domain and Out-domain FDKT refers to the SLM fine-tuned on one domain's private data mixed with augmented data $D^a$ and tested on another domain.

Theorems & Definitions (1)

Definition 2.1: Differential Privacy

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

TL;DR

Abstract

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (1)