Table of Contents
Fetching ...

Technical and Legal Aspects of Federated Learning in Bioinformatics: Applications, Challenges and Opportunities

Daniele Malpetti, Marco Scutari, Francesco Gualdi, Jessica van Setten, Sander van der Laan, Saskia Haitjema, Aaron Mark Lee, Isabelle Hering, Francesca Mangili

TL;DR

Federated learning addresses privacy and data-sharing barriers in bioinformatics by enabling collaborative model training across institutions without sharing patient data, using the total sample size $N$ and model parameters to aggregate insights. The paper surveys foundational FL concepts, security/privacy techniques, and practical bioinformatics applications across proteomics, GWAS, scRNA-seq, multi-omics, and medical imaging, highlighting ready-to-use tools like sfkit and FeatureCloud. It discusses legal and operational considerations under GDPR and AI governance, providing practical guidance and case studies that show FL can outperform isolated analyses while preserving data privacy. Looking forward, the authors identify challenges in scaling, heterogeneity, and verifiability, and point to future directions such as federated foundation models and FL-as-a-Service as essential for widespread, secure adoption in biomedical research.

Abstract

Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. This paper provides a gentle introduction to this approach in bioinformatics, and is the first to review key applications in proteomics, genome-wide association studies (GWAS), single-cell and multi-omics studies in their legal as well as methodological and infrastructural challenges. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have a similar impact in bioinformatics, allowing academic and clinical institutions to access many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks.

Technical and Legal Aspects of Federated Learning in Bioinformatics: Applications, Challenges and Opportunities

TL;DR

Federated learning addresses privacy and data-sharing barriers in bioinformatics by enabling collaborative model training across institutions without sharing patient data, using the total sample size and model parameters to aggregate insights. The paper surveys foundational FL concepts, security/privacy techniques, and practical bioinformatics applications across proteomics, GWAS, scRNA-seq, multi-omics, and medical imaging, highlighting ready-to-use tools like sfkit and FeatureCloud. It discusses legal and operational considerations under GDPR and AI governance, providing practical guidance and case studies that show FL can outperform isolated analyses while preserving data privacy. Looking forward, the authors identify challenges in scaling, heterogeneity, and verifiability, and point to future directions such as federated foundation models and FL-as-a-Service as essential for widespread, secure adoption in biomedical research.

Abstract

Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. This paper provides a gentle introduction to this approach in bioinformatics, and is the first to review key applications in proteomics, genome-wide association studies (GWAS), single-cell and multi-omics studies in their legal as well as methodological and infrastructural challenges. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have a similar impact in bioinformatics, allowing academic and clinical institutions to access many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of a typical federated learning (FL) workflow. (1) The central server initialises a global model. (2) The server shares the global model parameters with consortium parties, referred to as clients. (3) Each client initialises a local model from the global model parameters and updates it by training it on its local data. (4) Clients send their updated local model parameters back to the server. (5) The server aggregates local model parameters it collected to construct a new global model. (6) The server redistributes the updated global model parameters to clients to start the next training round. Steps (3)--(6) are repeated iteratively until a predefined stopping criterion is met. Active parties in each step are in green, and the arrows show the direction of information flow within the consortium.
  • Figure 2: Different FL topologies. In centralised topologies, the data holders are typically referred to as clients, reflecting their interaction with a central server. In decentralised topologies, where no central entity exists, the participants are often called parties.
  • Figure 3: Horizontal and vertical data partitioning in FL. In horizontal FL (left), clients hold data sets with the same features (c1--c3) but different subsets of samples (r1--r8). In vertical FL (right), clients hold data sets with different features (c1--c6) but the same set of samples (r1--r4).
  • Figure 4: Example of privacy-preserving sum computation in FL using three different techniques. Note that although differential privacy is described in Section \ref{['sec:privacy']}, it is not included in this example, as it would not be suitable for such a calculation.