Table of Contents
Fetching ...

Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

Saurav Prakash, Jin Sima, Chao Pan, Eli Chien, Olgica Milenkovic

TL;DR

The paper tackles learning on tree-like data by leveraging hyperbolic geometry to enable low-distortion embeddings, aiming to perform privacy-preserving federated classification across distributed biomedical datasets. It proposes a one-shot federated SVM in the Poincaré disc that communicates minimal convex hull information, resolves label switching with B_h sequences, and uses Poincaré quantization with Reed-Solomon–like encoding for secure transmission, followed by balanced graph partitioning to aggregate hulls at the server. The key contributions include a hyperbolic Graham scan for convex hulls, ε-Poincaré quantization, convex hull complexity and privacy leakage analysis, B_h-based label encoding, secure SCMA transmission, and a graph-partitioning server aggregator enabling accurate global SVM learning. Experiments on synthetic and single-cell RNA-seq datasets demonstrate that the federated hyperbolic approach can outperform Euclidean federated methods and approach centralized performance, highlighting the practical impact for privacy-preserving learning on hierarchical biological data.

Abstract

Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincaré discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit communication cost for the hulls, we propose a new quantization method for the Poincaré disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim 11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces.

Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

TL;DR

The paper tackles learning on tree-like data by leveraging hyperbolic geometry to enable low-distortion embeddings, aiming to perform privacy-preserving federated classification across distributed biomedical datasets. It proposes a one-shot federated SVM in the Poincaré disc that communicates minimal convex hull information, resolves label switching with B_h sequences, and uses Poincaré quantization with Reed-Solomon–like encoding for secure transmission, followed by balanced graph partitioning to aggregate hulls at the server. The key contributions include a hyperbolic Graham scan for convex hulls, ε-Poincaré quantization, convex hull complexity and privacy leakage analysis, B_h-based label encoding, secure SCMA transmission, and a graph-partitioning server aggregator enabling accurate global SVM learning. Experiments on synthetic and single-cell RNA-seq datasets demonstrate that the federated hyperbolic approach can outperform Euclidean federated methods and approach centralized performance, highlighting the practical impact for privacy-preserving learning on hierarchical biological data.

Abstract

Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincaré discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit communication cost for the hulls, we propose a new quantization method for the Poincaré disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces.
Paper Structure (30 sections, 9 theorems, 38 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 30 sections, 9 theorems, 38 equations, 7 figures, 3 tables, 3 algorithms.

Key Result

Theorem 5.1

Given a set $\mathcal{D}$ of $N$ points in the Poincaré disc, Alg. alg:PGS returns $CH(\mathcal{D})$ in $O(N\log N)$ time, where $CH(\mathcal{D})$ denotes the set of extreme points in the minimal convex hull of ${D}$.

Figures (7)

  • Figure 1: Diagram of the hyperbolic federated classification framework in the Poincaré model of a hyperbolic space. For simplicity, only binary classifiers for two clients are considered. (a) The clients embed their hierarchical data sets into the Poincaré disc. (b) The clients compute the convex hulls of their data to convey the extreme points to the server. (c) To efficiently communicate the extreme points, the Poincaré disc is uniformly quantized (due to distance skewing on the disc, the regions do not appear to be of the same size). (d) As part of the secure transmission module, only the information about the corresponding quantization bins containing extreme points is transmitted via Reed-Solomon coding, along with the unique labels of clusters held by the clients, selected from integer $B_h$ sequences (in this case, $h=2$ since there are two classes). (e) The server securely resolves the label switching issue via $B_2$-decoding. (f) Upon label disambiguation, the server constructs a complete weighted graph in which the convex hulls represent the nodes while the edge weights equal $w(\cdot,\cdot)=1/d(\cdot,\cdot)$, where $d(\cdot,\cdot)$ denotes the average pairwise hyperbolic distance between points in the two hulls. The server then performs balanced graph partitioning to aggregate the convex hulls and arrive at "proxies" for the original, global clusters. (g) Once the global clusters are reconstructed, a reference point (i.e., "bias" of the hyperbolic classifier), $\boldsymbol{p}$, is computed as the midpoint of the shortest geodesic between the convex hulls, and subsequently used for learning the "normal" vector $\boldsymbol{w}$ of the hyperbolic classifier.
  • Figure 4: Synthetic data classes constructed as described in the Data generation subsection. The red point denotes the sampled reference point $\boldsymbol{p}$. The geodesic through the red point is the ground truth hyperplane that corresponds to the sampled normal vector. In (a), we set $N=20,000$, while in (b), we set $N=60,000$.
  • Figure 5: Classification accuracy results for the synthetic data sets. The shaded areas represent the $95\%$ confidence interval for $10$ independent trials. (a)-(d) Influence of the parameter $\lVert{\boldsymbol{p}}\rVert$ on the classification accuracy. (e)-(h) Influence of the margin parameter $\gamma$ on the classification accuracy. (i)-(l) Influence of the Poincaré quantization parameter $\epsilon$ on classification accuracy.
  • Figure 6: Analysis of the impact of the quantization parameter $\epsilon$ on the convex hull distortion and complexity. The blue and green points correspond to labels $0$ and $3$ in the UC-Stromal data set. The solid lines denote the quantized convex hulls, while the dotted lines denote the actual convex hulls without quantization.
  • Figure 7: Impact of quantization parameter ${\epsilon}$ on the accuracy and convex hull complexity of the UC-Stromal data set for the multi-label classification setting with labels $0-3$). Here, as before, $CH$ denotes the quantized convex hull of a client. Average as well as maximum complexities are calculated over all clients and across all local quantized convex hulls. The shaded areas represent the $95\%$ confidence interval from $10$ independent trials.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Theorem 5.1
  • Theorem 5.2
  • Definition 5
  • Remark 5.3
  • Theorem A.1
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • proof
  • ...and 7 more