Technical and Legal Aspects of Federated Learning in Bioinformatics: Applications, Challenges and Opportunities
Daniele Malpetti, Marco Scutari, Francesco Gualdi, Jessica van Setten, Sander van der Laan, Saskia Haitjema, Aaron Mark Lee, Isabelle Hering, Francesca Mangili
TL;DR
Federated learning addresses privacy and data-sharing barriers in bioinformatics by enabling collaborative model training across institutions without sharing patient data, using the total sample size $N$ and model parameters to aggregate insights. The paper surveys foundational FL concepts, security/privacy techniques, and practical bioinformatics applications across proteomics, GWAS, scRNA-seq, multi-omics, and medical imaging, highlighting ready-to-use tools like sfkit and FeatureCloud. It discusses legal and operational considerations under GDPR and AI governance, providing practical guidance and case studies that show FL can outperform isolated analyses while preserving data privacy. Looking forward, the authors identify challenges in scaling, heterogeneity, and verifiability, and point to future directions such as federated foundation models and FL-as-a-Service as essential for widespread, secure adoption in biomedical research.
Abstract
Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. This paper provides a gentle introduction to this approach in bioinformatics, and is the first to review key applications in proteomics, genome-wide association studies (GWAS), single-cell and multi-omics studies in their legal as well as methodological and infrastructural challenges. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads to a faster and more robust exploration and translation of results. More widespread use of federated learning may have a similar impact in bioinformatics, allowing academic and clinical institutions to access many combinations of genotypic, phenotypic and environmental information that are undercovered or not included in existing biobanks.
