Table of Contents
Fetching ...

Training Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies

Astrid Brull, Sara Aguti, Véronique Bolduc, Ying Hu, Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Joaquin Del-Rio, Oleksii Sliusarenko, Haiyan Zhou, Francesco Muntoni, Carsten G. Bönnemann, Xabi Uribe-Etxebarria

TL;DR

COL6-RD diagnosis is limited by sparse, privacy-sensitive data. The authors implement a privacy-preserving federated learning framework across NIH and UCL to train a collagen VI immunofluorescence image classifier without sharing raw data, achieving a mean F1 of 0.82 and accuracy of 0.83, outperforming single-site baselines. The model distinguishes control images and three COL6-RD pathogenic mechanisms (exon skipping, glycine substitution, pseudoexon insertion), aiding interpretation of uncertain variants and guiding sequencing priorities. This work demonstrates the feasibility of cross-institutional FL for rare-disease imaging and points to future expansion, standardization, and multimodal integration to further improve diagnostic utility.

Abstract

The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.

Training Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies

TL;DR

COL6-RD diagnosis is limited by sparse, privacy-sensitive data. The authors implement a privacy-preserving federated learning framework across NIH and UCL to train a collagen VI immunofluorescence image classifier without sharing raw data, achieving a mean F1 of 0.82 and accuracy of 0.83, outperforming single-site baselines. The model distinguishes control images and three COL6-RD pathogenic mechanisms (exon skipping, glycine substitution, pseudoexon insertion), aiding interpretation of uncertain variants and guiding sequencing priorities. This work demonstrates the feasibility of cross-institutional FL for rare-disease imaging and points to future expansion, standardization, and multimodal integration to further improve diagnostic utility.

Abstract

The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the Sherpa.ai FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.

Paper Structure

This paper contains 19 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of the Sherpa.ai FL setup showing individual institutional accuracies and the resulting global federated model performance.
  • Figure 2: Collagen VI immunofluorescence images from control and patient-derived dermal fibroblast cultures. Each image represents one of the four classes. Scale bar = 50 $\mu$m.
  • Figure 3: Traditional (Centralized) training (left) compared to Federated training implemented on the Sherpa.ai FL platform (right). In FL, only model updates (not raw images) are shared with a central aggregator, enhancing privacy and regulatory compliance.
  • Figure 4: Mean metrics using Bazaga et al. model with different datasets.
  • Figure 5: Mean metrics comparison between the Bazaga et al. model and our approach for 10-fold cross-validation (NIH local model).
  • ...and 5 more figures

Theorems & Definitions (2)

  • Remark 1: Existence of solutions
  • Remark 2: Convexity