Table of Contents
Fetching ...

Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia

Erell Gachon, Jérémie Bigot, Elsa Cazelles, Audrey Bidet, Jean-Philippe Vial, Pierre-Yves Dumas, Aguirre Mimoun

TL;DR

This work tackles MRD detection in AML by recasting multi-patient flow cytometry datasets as high-dimensional probability measures and applying optimal-transport-based dimensionality reduction. By combining mean-measure quantization (via K-means on the merged data) with embeddings in linear spaces through either Wasserstein PCA or log-ratio PCA, the authors obtain informative 2D representations that reveal intra- and inter-patient variability and correlate with MRD measures. Across two datasets (DATAML-Bordeaux and HIPC), the OT-based approach outperforms kernel mean embeddings and aligns well with FlowSOM MRD results, enabling effective clustering and supervised classification of MRD status. This OT-driven framework offers a scalable, interpretable tool for MRD assessment and could enhance relapse prognosis when used alongside FlowSOM in clinical workflows.

Abstract

Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid Leukemia (AML), a type of cancer that affects the blood and bone marrow, is essential in the prognosis and follow-up of AML patients. As traditional cytological analysis cannot detect leukemia cells below 5\%, the analysis of flow cytometry dataset is expected to provide more reliable results. In this paper, we explore statistical learning methods based on optimal transport (OT) to achieve a relevant low-dimensional representation of multi-patient flow cytometry measurements (FCM) datasets considered as high-dimensional probability distributions. Using the framework of OT, we justify the use of the K-means algorithm for dimensionality reduction of multiple large-scale point clouds through mean measure quantization by merging all the data into a single point cloud. After this quantization step, the visualization of the intra and inter-patients FCM variability is carried out by embedding low-dimensional quantized probability measures into a linear space using either Wasserstein Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of compositional data. Using a publicly available FCM dataset and a FCM dataset from Bordeaux University Hospital, we demonstrate the benefits of our approach over the popular kernel mean embedding technique for statistical learning from multiple high-dimensional probability distributions. We also highlight the usefulness of our methodology for low-dimensional projection and clustering patient measurements according to their level of MRD in AML from FCM. In particular, our OT-based approach allows a relevant and informative two-dimensional representation of the results of the FlowSom algorithm, a state-of-the-art method for the detection of MRD in AML using multi-patient FCM.

Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia

TL;DR

This work tackles MRD detection in AML by recasting multi-patient flow cytometry datasets as high-dimensional probability measures and applying optimal-transport-based dimensionality reduction. By combining mean-measure quantization (via K-means on the merged data) with embeddings in linear spaces through either Wasserstein PCA or log-ratio PCA, the authors obtain informative 2D representations that reveal intra- and inter-patient variability and correlate with MRD measures. Across two datasets (DATAML-Bordeaux and HIPC), the OT-based approach outperforms kernel mean embeddings and aligns well with FlowSOM MRD results, enabling effective clustering and supervised classification of MRD status. This OT-driven framework offers a scalable, interpretable tool for MRD assessment and could enhance relapse prognosis when used alongside FlowSOM in clinical workflows.

Abstract

Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid Leukemia (AML), a type of cancer that affects the blood and bone marrow, is essential in the prognosis and follow-up of AML patients. As traditional cytological analysis cannot detect leukemia cells below 5\%, the analysis of flow cytometry dataset is expected to provide more reliable results. In this paper, we explore statistical learning methods based on optimal transport (OT) to achieve a relevant low-dimensional representation of multi-patient flow cytometry measurements (FCM) datasets considered as high-dimensional probability distributions. Using the framework of OT, we justify the use of the K-means algorithm for dimensionality reduction of multiple large-scale point clouds through mean measure quantization by merging all the data into a single point cloud. After this quantization step, the visualization of the intra and inter-patients FCM variability is carried out by embedding low-dimensional quantized probability measures into a linear space using either Wasserstein Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of compositional data. Using a publicly available FCM dataset and a FCM dataset from Bordeaux University Hospital, we demonstrate the benefits of our approach over the popular kernel mean embedding technique for statistical learning from multiple high-dimensional probability distributions. We also highlight the usefulness of our methodology for low-dimensional projection and clustering patient measurements according to their level of MRD in AML from FCM. In particular, our OT-based approach allows a relevant and informative two-dimensional representation of the results of the FlowSom algorithm, a state-of-the-art method for the detection of MRD in AML using multi-patient FCM.
Paper Structure (28 sections, 1 theorem, 20 equations, 8 figures, 2 tables)

This paper contains 28 sections, 1 theorem, 20 equations, 8 figures, 2 tables.

Key Result

Proposition 1

When the $\mu^i$'s are absolutely continuous w.r.t. the Lebesgue measure, the Wasserstein barycenter problem with varying weights newbary is equivalent to the $K$-means quantization of the mean measure $\overline{\mu} = \frac{1}{N} \sum_{i=1}^N \mu^i$, that is, where $\overline{\mu} = \frac{1}{N} \sum_{i=1}^N \mu^i$ is the average measure, and $V_{x_k}$ is the Voronoi cell centered at point $x_k$

Figures (8)

  • Figure 1: HIPC dataset : Silhouette score and execution time of our method as a function of the number of clusters. The red dotted line represents the execution time for the reference measure chosen as either uniform, random among the $\nu^{i}$'s or the concatenation of two measures $(\nu^{i},\nu^{j})$.
  • Figure 2: Two-dimensional representations of the HIPC data using PCA on different embeddings : (top left) KME, (top right) K-Means + comp, (bottom left) K-Means + LinW2, (bottom right) FlowSOM + LinW2. Each color corresponds to a patient. Each marker encodes the laboratory where the data was processed.
  • Figure 3: Comparison of time of execution for the different embeddings.
  • Figure 4: Minimum Spanning Tree for the diagnosis and two follow-up measurements of one patient of the DATAML-Bordeaux dataset, and one normal bone marrow.
  • Figure 5: Two-dimensional representations of the DATAML-Bordeaux datasets using PCA on different embeddings : (top) K-Means + LinW2, (middle) K-Means + comp, (bottom) KME. Each dot represents a FCM sample. The colors encore the nature of the data : (red) diagnostic, (purple) normal bone marrow, (yellow) positive follow-up, (green) negative follow-up, (gray) no information on the MRD-BioM.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof : Proof of Proposition \ref{['prop1']}.