Table of Contents
Fetching ...

Federated t-SNE and UMAP for Distributed Data Visualization

Dong Qiao, Xinxian Ma, Jicong Fan

TL;DR

This paper tackles the challenge of visualizing high-dimensional data distributed across multiple sites without compromising privacy. It introduces Federated Distribution Learning (FedDL) to learn a global landmark set that captures data distribution via Maximum Mean Discrepancy, enabling Nyström-based reconstruction of the full distance matrix required by t-SNE and UMAP. The authors present Fed-tSNE and Fed-UMAP (and privacy-enhanced Fed-tSNE+ and Fed-UMAP+) and extend the framework to federated spectral clustering, with theoretical convergence and differential privacy guarantees. Empirical results on MNIST and Fashion-MNIST show that federated Visualizations and clustering achieve accuracy close to centralized methods, validating the practicality of privacy-preserving distributed data visualization and clustering.

Abstract

High-dimensional data visualization is crucial in the big data era and these techniques such as t-SNE and UMAP have been widely used in science and engineering. Big data, however, is often distributed across multiple data centers and subject to security and privacy concerns, which leads to difficulties for the standard algorithms of t-SNE and UMAP. To tackle the challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide high-dimensional data visualization under the framework of federated learning, without exchanging data across clients or sending data to the central server. The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution information of data in a manner of federated learning and then estimating the global distance matrix for t-SNE and UMAP. To further enhance the protection of data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to federated spectral clustering, yielding algorithms of clustering distributed data. In addition to these new algorithms, we offer theoretical guarantees of optimization convergence, distance and similarity estimation, and differential privacy. Experiments on multiple datasets demonstrate that, compared to the original algorithms, the accuracy drops of our federated algorithms are tiny.

Federated t-SNE and UMAP for Distributed Data Visualization

TL;DR

This paper tackles the challenge of visualizing high-dimensional data distributed across multiple sites without compromising privacy. It introduces Federated Distribution Learning (FedDL) to learn a global landmark set that captures data distribution via Maximum Mean Discrepancy, enabling Nyström-based reconstruction of the full distance matrix required by t-SNE and UMAP. The authors present Fed-tSNE and Fed-UMAP (and privacy-enhanced Fed-tSNE+ and Fed-UMAP+) and extend the framework to federated spectral clustering, with theoretical convergence and differential privacy guarantees. Empirical results on MNIST and Fashion-MNIST show that federated Visualizations and clustering achieve accuracy close to centralized methods, validating the practicality of privacy-preserving distributed data visualization and clustering.

Abstract

High-dimensional data visualization is crucial in the big data era and these techniques such as t-SNE and UMAP have been widely used in science and engineering. Big data, however, is often distributed across multiple data centers and subject to security and privacy concerns, which leads to difficulties for the standard algorithms of t-SNE and UMAP. To tackle the challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide high-dimensional data visualization under the framework of federated learning, without exchanging data across clients or sending data to the central server. The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution information of data in a manner of federated learning and then estimating the global distance matrix for t-SNE and UMAP. To further enhance the protection of data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to federated spectral clustering, yielding algorithms of clustering distributed data. In addition to these new algorithms, we offer theoretical guarantees of optimization convergence, distance and similarity estimation, and differential privacy. Experiments on multiple datasets demonstrate that, compared to the original algorithms, the accuracy drops of our federated algorithms are tiny.

Paper Structure

This paper contains 30 sections, 7 theorems, 69 equations, 7 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Assume the gradient of all local objective functions $\{f_p\}_{p=1}^P$ are $L_p$-Lipschitz continuous, $L = \sum_{p=1}^P\omega_p L_p$ with $\omega_p = \frac{n_p}{n_x}$, $\rho_L = \frac{\sum_{p=1}^P\omega_p L_p^2}{L^2}$, and $\Vert\nabla f_p - \nabla f_{p'}\Vert_F \le \zeta$ for all $p,p'$, the seque

Figures (7)

  • Figure 1: MNIST Data Visualization. Row 1: t-SNE, Fed-tSNE, and Fed-tSNE+. Row 2: UMAP, Fed-UMAP, and Fed-UMAP+.
  • Figure 2: Convergence Performance on MNIST
  • Figure 3: Visualization of Fed-tSNE and Fed-UMAP Convergence from epoch $1$ to $10$ (MNIST)
  • Figure 4: Visualization of Fashion-MNIST
  • Figure 5: Visualization of Fed-tSNE and Fed-UMAP Convergence (MNIST)
  • ...and 2 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2: Error bounds of Nyström approximation
  • Theorem 3: Error bound of Nyström approximation with FedDL having data perturbation
  • Theorem 4: Differential privacy of FedDL with data perturbation
  • Theorem 5: Error bound of Nyström approximation with FedDL having gradient perturbation
  • Theorem 6: Differential privacy of FedDL with gradient perturbation
  • Lemma 1
  • proof
  • proof : Proof of Theorem \ref{['thm:reconstruction error 1']}
  • proof
  • ...and 3 more