MMIL: A novel algorithm for disease associated cell type discovery

Erin Craig; Timothy Keyes; Jolanda Sarno; Maxim Zaslavsky; Garry Nolan; Kara Davis; Trevor Hastie; Robert Tibshirani

MMIL: A novel algorithm for disease associated cell type discovery

Erin Craig, Timothy Keyes, Jolanda Sarno, Maxim Zaslavsky, Garry Nolan, Kara Davis, Trevor Hastie, Robert Tibshirani

TL;DR

The paper addresses the challenge of identifying disease-associated cell populations when only patient-level labels are available. It presents MMIL, an Expectation-Maximization–based framework that jointly estimates latent cell labels and trains cell-level classifiers, and it supports calibration of predicted probabilities and semi-supervised learning. Through applications to AML and ALL mass cytometry datasets, MMIL demonstrates accurate cancer-cell identification, robust generalization across patients, tissues, and treatment timepoints, and biologically meaningful feature selection. These results suggest MMIL offers a robust, calibration-friendly tool for cell-level disease discovery, diagnostics, and monitoring in contexts with unknown gold-standard cell labels and high-dimensional data.

Abstract

Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality.

MMIL: A novel algorithm for disease associated cell type discovery

TL;DR

Abstract

Paper Structure (33 sections, 18 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 33 sections, 18 equations, 10 figures, 1 table, 2 algorithms.

Introduction
Results
An algorithm to classify cells without complete cell labels
Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using only unlabeled cells
Mixture modeling identifies cancer cells in Acute Myeloid Leukemia using both labeled and unlabeled cells
Mixture modeling identifies and tracks cancer cells throughout treatment progression in Acute Lymphoblastic Leukemia
Discussion
Methods
Related work
Simulation reveals that calibration is possible without ground truth labels
Algorithm details
AML dataset analysis
Data acquisition
Data cleaning and preprocessing
Model fitting
...and 18 more sections

Figures (10)

Figure 1: Mixture Modeling for Multiple Instance Learning (MMIL) detects cancer cells in Acute Myeloid Leukemia (AML) using patient labels only(A) Process to train a mixture model for multiple instance learning data. We initialize the sick person's cell labels as $0.5$: in this example, we assume that the prevalence of diseased cells in sick people is $50\%$, and so each cell has a 50/50 chance of being diseased. After training the first classifier, we improve our estimates of the sick person's cell labels. This process is repeated until convergence. (B) Schematic of model training and evaluation on the AML cohort from tsai2020multiplexed. (C) Nonzero coefficients for the Mixture Lasso model trained to detect leukemic blasts in AML. (D) Nonzero coefficients for the "Optimal" Lasso model trained to detect leukemic blasts in AML. (E) Receiver-Operator Characteristic (ROC) curves demonstrating individual (thin) and average (thick) performance of Optimal, Mixture, and Naive lasso models trained to detect leukemic blasts in AML. Insets indicate mean area under the ROC curve (AUC) across all patients. (F) Scatterplots representing the relationship between the gold-standard, pathologist-enumerated blast percentage for each patient (X-axis) and the model-assigned blast percentage for each patient for the Optimal (red), Mixture (blue), and Naive (yellow) lasso models. Inset text represents the Pearson correlation coefficients between the values on the X- and Y-axes.
Figure 2: MMIL identifies regions of high-dimensional phenotype space occupied by cells from AML patients, but not by cells from healthy controls.(A) A scatterplot of uniform manifold approximation and projection (UMAP) coordinates colored by Mixture Lasso probabilities. Cells with probability scores of 0 have a very small chance of being AML-associated (i.e. leukemic blasts), whereas cells with probability scores close to 1 have a high chance of being AML-associated (leukemic blasts). (B) UMAP plot as in (A), but with cells annotated as leukemic blasts by a pathologist in red and cells annotated as healthy (i.e. non-leukemic blasts) by a pathologist in blue. Note the general agreement of probabilities in (A) to red regions in (B). (C) UMAP plot as in (A), but with cells collected from cancer patients shown in orange and cells collected from healthy controls in blue. Note that regions with overlapping orange and blue cells are assigned low Mixture Lasso probabilities in (A). (D) A UMAP plot of local "phenotypic neighborhoods" spaced throughout the high-dimensional point cloud selected by density-dependent downsampling (Methods). Neighborhoods are colored based on the proportion of cells that they contain that come from cancer patients. Note that neighborhoods exclusively comprised of cells from cancer patients are assigned high Mixture Lasso probabilities in (A). (E) A UMAP plot as in (A), but with cells colored by their expression of Lactoferrin, the marker with the largest coefficient in the Mixture Lasso model. See also Supplementary Figure \ref{['fig:supplemental_figure_2']}. (F) Count heatmap of 2-dimensional bins demonstrating the correlation between the average Mixture Lasso probability in a phenotypic neighborhood (Y-axis) and the proportion of cells from cancer patients it contains (X-axis). Bins are colored by the density of neighborhoods in that region of the plot, and the red line represents the locally-weighted moving average across the x-axis. Inset text indicates the Pearson correlation between the values on the X- and Y-axes. Note: In panels A-E, UMAP coordinates were calculated using all protein markers.
Figure 3: MMIL can train on labeled and unlabeled data simultaneously to incorporate expert knowledge while remaining robust to imperfect labeling.(A) Schematic of semi-supervised "0-shot" and "1-shot" MMIL experiments (also see Methods). (B) Boxplots indicating average area under the receiver-operator characteristic curve (AUROC) for Mixture Lasso (blue), Naive (orange), and Optimal models across 0-shot (left), 1-shot (with perfect labels; middle), and 1-shot (with imperfect labels; right) training procedures. "***" represents statistical significance at the level of $p<0.0001$ using a paired Student's t-test with Benjamini-Hochberg correction for multiple comparisons. (C) Positive coefficients for an Optimal Lasso model fit on a single patient (AML-5An). (D) Positive Mixture Lasso coefficients after 0-shot (left) and 1-shot (right) learning. Note that rRNA, the feature with the largest Optimal Lasso coefficient in (C) and \ref{['fig:algorithm']}d, was selected with a positive coefficient only after 1-shot learning. (E) Two-sided barplot indicating how many times a feature was included in the Mixture Lasso model with a positive coefficient after 0-shot (left, blue) and 1-shot (right, orange) training. Dashed gray lines indicate the maximum number of times a feature could have been included (13, the total number of 1-shot experiments).
Figure 4: Mixture Lasso identifies leukemia cells despite being trained without any cell labels. We compare the performance of Mixture Lasso to the optimal model (trained using gold standard labels that are typically unavailable) and the naive model (trained using patient labels in place of cell labels). Mixture Lasso generalizes better than the naive approach from bone marrow to blood samples and across time, evidenced by its performance on blood samples collected at (B) diagnosis, (C) day 8 post-treatment initiation, and (D) day 15 post-treatment initiation.
Figure 5: Mixture Lasso selects features that discriminate between healthy cells and cancer cells (leukemic blasts) in Acute Myeloid Leukemia (AML). Density plots indicating that 5 markers (Lactoferrin, CD16, CD56, Lamin A/C, and CD33) selected by Mixture Lasso trained using only patient labels successfully separate healthy and cancer cell populations. Healthy patient IDs are the top 3 density plots in each panel (REACTIVE-3A, REACTIVE-2A, and REACTIVE-3A), and all other plots are AML patients. Note that rRNA, which strongly separates healthy and cancer cells, was not selected by Mixture Lasso, representing an interesting instance of a missed discovery.
...and 5 more figures

MMIL: A novel algorithm for disease associated cell type discovery

TL;DR

Abstract

MMIL: A novel algorithm for disease associated cell type discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (10)