Table of Contents
Fetching ...

One-Shot Collaborative Data Distillation

William Holland, Chandra Thapa, Sarah Ali Siddiqui, Wei Shao, Seyit Camtepe

TL;DR

This work tackles the challenge of distilling large datasets into compact, high-fidelity synthetic data for distributed learning where client data are heterogeneous. It introduces CollabDM, a collaborative data distillation method based on distribution matching that requires only a single round of client-server communication, enabling global distillation without training multiple models across rounds. Empirical results show CollabDM and its enhanced variant CollabDM-pae outperform state-of-the-art one-shot methods (e.g., DENSE) on skewed partitions and deliver robust performance across benchmarks and a 5G attack-detection application, including strong cross-architecture transfer. The approach offers practical benefits for privacy-preserving, communication-efficient data sharing and rapid deployment in edge and networked environments like 5G networks.

Abstract

Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples. These synthetic sets support efficient model learning and reduce the communication cost of data sharing. Thus, high-fidelity distilled data can support the efficient deployment of machine learning applications in distributed network environments. A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server. However, the quality of the resulting set is impaired by heterogeneity in the distributions of the local data held by clients. To overcome this challenge, we introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server. Our method outperforms the state-of-the-art one-shot learning method on skewed data in distributed learning environments. We also show the promising practical benefits of our method when applied to attack detection in 5G networks.

One-Shot Collaborative Data Distillation

TL;DR

This work tackles the challenge of distilling large datasets into compact, high-fidelity synthetic data for distributed learning where client data are heterogeneous. It introduces CollabDM, a collaborative data distillation method based on distribution matching that requires only a single round of client-server communication, enabling global distillation without training multiple models across rounds. Empirical results show CollabDM and its enhanced variant CollabDM-pae outperform state-of-the-art one-shot methods (e.g., DENSE) on skewed partitions and deliver robust performance across benchmarks and a 5G attack-detection application, including strong cross-architecture transfer. The approach offers practical benefits for privacy-preserving, communication-efficient data sharing and rapid deployment in edge and networked environments like 5G networks.

Abstract

Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples. These synthetic sets support efficient model learning and reduce the communication cost of data sharing. Thus, high-fidelity distilled data can support the efficient deployment of machine learning applications in distributed network environments. A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server. However, the quality of the resulting set is impaired by heterogeneity in the distributions of the local data held by clients. To overcome this challenge, we introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server. Our method outperforms the state-of-the-art one-shot learning method on skewed data in distributed learning environments. We also show the promising practical benefits of our method when applied to attack detection in 5G networks.
Paper Structure (18 sections, 12 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of CollabDM. In a single round of communication, the server sends seeds to initialize learning models. The client then distills local data and computes embeddings on the seeded models. Locally distilled data and computed embeddings are then sent to the server. The server uses the embeddings to refine the distilled data to reflect the global data distribution.
  • Figure 2: Testing accuracy vs. data transmitted per client across different parameter settings. The dashed red line corresponds to the classification accuracy of partition-and-expand distribution matching in the central model.
  • Figure 3: The impact of images-per-class on testing accuracy for 5G network traffic data.

Theorems & Definitions (1)

  • Definition 1: Data Distillation (sachdeva2023data)