Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia
Erell Gachon, Jérémie Bigot, Elsa Cazelles, Audrey Bidet, Jean-Philippe Vial, Pierre-Yves Dumas, Aguirre Mimoun
TL;DR
This work tackles MRD detection in AML by recasting multi-patient flow cytometry datasets as high-dimensional probability measures and applying optimal-transport-based dimensionality reduction. By combining mean-measure quantization (via K-means on the merged data) with embeddings in linear spaces through either Wasserstein PCA or log-ratio PCA, the authors obtain informative 2D representations that reveal intra- and inter-patient variability and correlate with MRD measures. Across two datasets (DATAML-Bordeaux and HIPC), the OT-based approach outperforms kernel mean embeddings and aligns well with FlowSOM MRD results, enabling effective clustering and supervised classification of MRD status. This OT-driven framework offers a scalable, interpretable tool for MRD assessment and could enhance relapse prognosis when used alongside FlowSOM in clinical workflows.
Abstract
Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid Leukemia (AML), a type of cancer that affects the blood and bone marrow, is essential in the prognosis and follow-up of AML patients. As traditional cytological analysis cannot detect leukemia cells below 5\%, the analysis of flow cytometry dataset is expected to provide more reliable results. In this paper, we explore statistical learning methods based on optimal transport (OT) to achieve a relevant low-dimensional representation of multi-patient flow cytometry measurements (FCM) datasets considered as high-dimensional probability distributions. Using the framework of OT, we justify the use of the K-means algorithm for dimensionality reduction of multiple large-scale point clouds through mean measure quantization by merging all the data into a single point cloud. After this quantization step, the visualization of the intra and inter-patients FCM variability is carried out by embedding low-dimensional quantized probability measures into a linear space using either Wasserstein Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of compositional data. Using a publicly available FCM dataset and a FCM dataset from Bordeaux University Hospital, we demonstrate the benefits of our approach over the popular kernel mean embedding technique for statistical learning from multiple high-dimensional probability distributions. We also highlight the usefulness of our methodology for low-dimensional projection and clustering patient measurements according to their level of MRD in AML from FCM. In particular, our OT-based approach allows a relevant and informative two-dimensional representation of the results of the FlowSom algorithm, a state-of-the-art method for the detection of MRD in AML using multi-patient FCM.
