Training Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies
Astrid Brull, Sara Aguti, Véronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Joaquin Del-Rio, Oleksii Sliusarenko, Haiyan Zhou, Francesco Muntoni, Carsten G. Bönnemann, Xabi Uribe-Etxebarria
TL;DR
COL6-RD diagnosis is limited by sparse, privacy-sensitive data. The authors implement a privacy-preserving federated learning framework across NIH and UCL to train a collagen VI immunofluorescence image classifier without sharing raw data, achieving a mean F1 of 0.82 and accuracy of 0.83, outperforming single-site baselines. The model distinguishes control images and three COL6-RD pathogenic mechanisms (exon skipping, glycine substitution, pseudoexon insertion), aiding interpretation of uncertain variants and guiding sequencing priorities. This work demonstrates the feasibility of cross-institutional FL for rare-disease imaging and points to future expansion, standardization, and multimodal integration to further improve diagnostic utility.
Abstract
The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
