Table of Contents
Fetching ...

Conditional Similarity Triplets Enable Covariate-Informed Representations of Single-Cell Data

Chi-Jane Chen, Haidong Yi, Natalie Stanley

TL;DR

CytoCoSet, a set-based encoding method, is introduced, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations, which leads to improved prediction of clinical phenotypes.

Abstract

Single-cell technologies enable comprehensive profiling of diverse immune cell-types through the measurement of multiple genes or proteins per individual cell. In order to translate immune signatures assayed from blood or tissue into powerful diagnostics, machine learning approaches are often employed to compute immunological summaries or per-sample featurizations, which can be used as inputs to models for outcomes of interest. Current supervised learning approaches for computing per-sample representations are trained only to accurately predict a single outcome and do not take into account relevant additional clinical features or covariates that are likely to also be measured for each sample. Here, we introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information. Our introduced method CytoCoSet is a set-based encoding method for learning per-sample featurizations, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations. Overall, incorporating clinical covariates enables the learning of encodings for each individual sample that ultimately improve prediction of clinical outcome.

Conditional Similarity Triplets Enable Covariate-Informed Representations of Single-Cell Data

TL;DR

CytoCoSet, a set-based encoding method, is introduced, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations, which leads to improved prediction of clinical phenotypes.

Abstract

Single-cell technologies enable comprehensive profiling of diverse immune cell-types through the measurement of multiple genes or proteins per individual cell. In order to translate immune signatures assayed from blood or tissue into powerful diagnostics, machine learning approaches are often employed to compute immunological summaries or per-sample featurizations, which can be used as inputs to models for outcomes of interest. Current supervised learning approaches for computing per-sample representations are trained only to accurately predict a single outcome and do not take into account relevant additional clinical features or covariates that are likely to also be measured for each sample. Here, we introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information. Our introduced method CytoCoSet is a set-based encoding method for learning per-sample featurizations, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations. Overall, incorporating clinical covariates enables the learning of encodings for each individual sample that ultimately improve prediction of clinical outcome.
Paper Structure (16 sections, 7 equations, 8 figures)

This paper contains 16 sections, 7 equations, 8 figures.

Figures (8)

  • Figure 1: Overview. Schematic overview of CytoCoSet. ( a) Given a multi-sample single-cell dataset additional covariates measured in each sample, ( b)the CytoCoSet algorithm defines a set of triplets based on Random Fourier Features (RFFs) to constrain the process of learning per-sample embedding vectors. A triplet is a combination of three samples, such that two samples have similar covariates and should have similar embeddings, and the third sample is distinct in terms of covariates and should therefore have a more divergent embedding. ( c) The loss function specified to optimize the embeddings is comprised of a binary cross entropy term to enforce prediction accuracy, and a triplet term, which encodes covariate-based similarity constraints. ( d) Embedding vectors learned by the model can be used to train machine learning models of clinical outcome.
  • Figure 2: Overview of how Random Fourier Features are used to select triplets. An illustration of how Random Fourier Features (RFFs) are used to summarize the overall immune landscape for each sample. (a) The columns of a cell $\times$ feature input matrix ${A}_i$ are transformed with $\frac{d}{2}$ Gaussian random variables ($P$) to produce a new matrix, ${A'}_i$. (b) Per-cell Random Fourier Features (RFFs) are constructed by concatenating sine and cosine transformed features for each cell (We use $\oplus$ as the notation for concatenation). Sine and cosine transformed values of ${A'}_i$ are used to form a matrix $Z_i$ across all cells. Finally, a Random Fourier Feature vector $S_{i}$ is constructed pooling (median or max) the feature values in each dimension across all cells.
  • Figure 3: Classification AUC in CyTOF Datasets. (a) CytoCoSet and baseline methods were assessed for their effectiveness in generating per-sample encodings that predict binary clinical outcomes across three CyTOF datasets (Preeclampsia, Preterm, Lung Cancer). Barplots reflect the mean AUC obtained across 30 unique train/test splits (using mean as the pooling operation to select triplets with RFFs). Error bars reflect 95% confidence intervals around the mean. We indicated pairs of methods with statistically-significant ($p<0.05$) differences in accuracy. (b) We evaluated CytoCoSet under various choices of pooling operation in the RFF step. Results show the mean AUCs obtained by incorporating various combinations of covariate and pooling operation, denoted in labels as 'covariate-pooling operation'. The covariate-pooling strategy leading the highest classification accuracy under CytoCoSet in each dataset is denoted by a non navy blue bar.
  • Figure 4: Quantifying Alignment of Embedding Vectors with Covariates. Boxplots visualize distances computed between pairs of samples with the same age (denoted as 'Same') and between those with different-age (denoted as 'Diff') under CytoSet (yellow) and CytoCoSet (green) approaches. The green triangle in each boxplot represents the mean embedding distance.
  • Figure 5: Sensitivity Analysis of Parameters within the Loss Function. We systematically varied the three model hyperparameters, $\alpha$, 'same threshold' ($H_{s}$) and 'diff threshold' ($H_{d}$) and visualized the mean AUC/standard deviation over ten trials with different train/test splits. Ten trials were run with different train/test splits, and yellow squares denote the optimal hyperparameter combinations across the ten trials. Each heatmap grid also denotes the number of trials that a particular parameter combination was optimal in.
  • ...and 3 more figures