Table of Contents
Fetching ...

Mixed Models with Multiple Instance Learning

Jan P. Engelmann, Alessandro Palma, Jakub M. Tomczak, Fabian J. Theis, Francesco Paolo Casale

TL;DR

This work introduces MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity, and reveals that MixMIL outperforms existing MIL models in single-cell datasets.

Abstract

Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.

Mixed Models with Multiple Instance Learning

TL;DR

This work introduces MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity, and reveals that MixMIL outperforms existing MIL models in single-cell datasets.

Abstract

Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.
Paper Structure (66 sections, 14 equations, 8 figures, 16 tables)

This paper contains 66 sections, 14 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: (a) MixMIL uses predefined instance embeddings from domain-specific unsupervised models for robustness and efficiency. (b) Generalized multi-instance mixed model framework defining MixMIL.
  • Figure 2: (a-c) Out-of-sample prediction accuracy (Spearman correlation, $\rho$) for MixMIL, GLMM, and baseline MILs (ABMIL, Gated ABMIL, DSMIL, and Bayes-MIL) varying the sample size (a), the amount of instance importance heterogeneity (b) and the number of instances (c). (d-f) Instance retrieval ROC-AUC of MixMIL and baseline MILs for the top 10% of instances in the same simulated scenarios. GLMM is not shown as it is not designed for instance retrieval. Stars denote default values that were kept constant while varying other parameters. Error bars denote standard errors across 10 repeat experiments. Full results across all methods and scenarios can be found in Section \ref{['sec:simulation_appendix']}.
  • Figure 3: Scatter plots comparing the prediction performance (Spearman correlation, $\rho$) of MixMIL (y-axis) against baseline MILs (x-axis) for 28 genetic labels: MixMIL vs ABMIL (a), MixMIL vs DSMIL (b), MixMIL vs Bayes-MIL (c), MixMIL vs Gated ABMIL (d). Genetic labels for which MixMIL yielded improved prediction accuracy are highlighted in red. The count of these genes and the P-values from a binomial test (assuming a null of 50/50 performance chance over 28 trials) are reported for each comparison.
  • Figure 4: Top and bottom 16 weighted cells for the Latrunculin B drug for different MIL methods.
  • Figure E.1: Out-of-sample prediction performance (Spearman correlation, $\rho$) in different simulation settings, varying one parameter at a time while keeping the others constant. Specifically, we varied the number of bags (a), the number of instances (b), the number of features (c), the heterogeneity effect (d) and the variance of signal (e).
  • ...and 3 more figures