Table of Contents
Fetching ...

A Hybrid Federated Kernel Regularized Least Squares Algorithm

Celeste Damiani, Yulia Rodina, Sergio Decherchi

TL;DR

The paper tackles privacy-preserving learning in a hybrid horizontal-vertical federated setting where clinical and omics data are distributed across hospitals and omics centers. It develops two kernel-based RRLS procedures under a hybrid federated Conjugate Gradient framework: a fast naive method and a secure iterative variant that preserves data privacy through aggregated updates and synchronized noise removal. The authors prove that the federated methods converge to the centralized RRLS solution and demonstrate competitive performance on several datasets, while also analyzing security via Nyström-like landmarks and EDM reconstruction risk. They also explore defense strategies, including randomized kernel widths, to mitigate potential leakage, and discuss integrating RRLS into deeper multi-omics pipelines for practical impact. Overall, the work advances privacy-aware, kernel-based learning in complex data-partition scenarios with practical implications for clinical-omics research.

Abstract

Federated learning is becoming an increasingly viable and accepted strategy for building machine learning models in critical privacy-preserving scenarios such as clinical settings. Often, the data involved is not limited to clinical data but also includes additional omics features (e.g. proteomics). Consequently, data is distributed not only across hospitals but also across omics centers, which are labs capable of generating such additional features from biosamples. This scenario leads to a hybrid setting where data is scattered both in terms of samples and features. In this hybrid setting, we present an efficient reformulation of the Kernel Regularized Least Squares algorithm, introduce two variants and validate them using well-established datasets. Lastly, we discuss security measures to defend against possible attacks.

A Hybrid Federated Kernel Regularized Least Squares Algorithm

TL;DR

The paper tackles privacy-preserving learning in a hybrid horizontal-vertical federated setting where clinical and omics data are distributed across hospitals and omics centers. It develops two kernel-based RRLS procedures under a hybrid federated Conjugate Gradient framework: a fast naive method and a secure iterative variant that preserves data privacy through aggregated updates and synchronized noise removal. The authors prove that the federated methods converge to the centralized RRLS solution and demonstrate competitive performance on several datasets, while also analyzing security via Nyström-like landmarks and EDM reconstruction risk. They also explore defense strategies, including randomized kernel widths, to mitigate potential leakage, and discuss integrating RRLS into deeper multi-omics pipelines for practical impact. Overall, the work advances privacy-aware, kernel-based learning in complex data-partition scenarios with practical implications for clinical-omics research.

Abstract

Federated learning is becoming an increasingly viable and accepted strategy for building machine learning models in critical privacy-preserving scenarios such as clinical settings. Often, the data involved is not limited to clinical data but also includes additional omics features (e.g. proteomics). Consequently, data is distributed not only across hospitals but also across omics centers, which are labs capable of generating such additional features from biosamples. This scenario leads to a hybrid setting where data is scattered both in terms of samples and features. In this hybrid setting, we present an efficient reformulation of the Kernel Regularized Least Squares algorithm, introduce two variants and validate them using well-established datasets. Lastly, we discuss security measures to defend against possible attacks.
Paper Structure (21 sections, 15 equations, 8 figures, 6 tables)

This paper contains 21 sections, 15 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Visual comparison between KRLS and RRLS.
  • Figure 2: In the hybrid federated setting samples are distributed among several hospitals, and features for said samples are hosted by other entities, such as omics centers.
  • Figure 3: In this example we assume that we are working with patients distributed over two hospitals, who share an array of clinical feature, and three omics centers. The first two provide genomics data (in particular Omics Center 1 provides the genomics features belonging to patients from Hospital 1 and Omics Center 2 the genomics for Hospital 2, while Omics Center 3 provides proteomics features for all the hospitals. Of course, there might be missing data as not all features will have been calculated for all patients.
  • Figure 4: Sequence diagram for an instance of Algorithm \ref{['A:main']} where we have two hospitals and three omics centers. Both the secure initialization and the main loop do a cycle on hospitals, where clinical data of different batches of patients reside ( horizontal federation: these iterations are represented with the $\bullet$ symbol). Then for each hospital both initialization and main loop cycle on the omics centers, where different features for the considered patients are stored, calling Algorithm \ref{['A:featureFed']} (vertical federation: these iterations are represented with the $\circ$ symbol). The initialization is secure because neither the kernel submatrices nor the labels get ever sent without being modified by some random quantity known only by the server.
  • Figure 5: PCA visualizations of some of the training sets, showcasing the three different methods used to select the landmarks set $W$.
  • ...and 3 more figures