Table of Contents
Fetching ...

SpaRED benchmark: Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion

Gabriel Mejia, Daniela Ruiz, Paula Cárdenas, Leonardo Manrique, Daniela Vega, Pablo Arbeláez

TL;DR

A systematically curated and processed database collected from 26 public sources is presented, representing an 8.6-fold increase compared to previous works and a state-of-the-art transformer based completion technique for inferring missing gene expression is proposed, which significantly boosts the performance of transcriptomic profile predictions across all datasets.

Abstract

Spatial Transcriptomics is a novel technology that aligns histology images with spatially resolved gene expression profiles. Although groundbreaking, it struggles with gene capture yielding high corruption in acquired data. Given potential applications, recent efforts have focused on predicting transcriptomic profiles solely from histology images. However, differences in databases, preprocessing techniques, and training hyperparameters hinder a fair comparison between methods. To address these challenges, we present a systematically curated and processed database collected from 26 public sources, representing an 8.6-fold increase compared to previous works. Additionally, we propose a state-of-the-art transformer based completion technique for inferring missing gene expression, which significantly boosts the performance of transcriptomic profile predictions across all datasets. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on spatial transcriptomics.

SpaRED benchmark: Enhancing Gene Expression Prediction from Histology Images with Spatial Transcriptomics Completion

TL;DR

A systematically curated and processed database collected from 26 public sources is presented, representing an 8.6-fold increase compared to previous works and a state-of-the-art transformer based completion technique for inferring missing gene expression is proposed, which significantly boosts the performance of transcriptomic profile predictions across all datasets.

Abstract

Spatial Transcriptomics is a novel technology that aligns histology images with spatially resolved gene expression profiles. Although groundbreaking, it struggles with gene capture yielding high corruption in acquired data. Given potential applications, recent efforts have focused on predicting transcriptomic profiles solely from histology images. However, differences in databases, preprocessing techniques, and training hyperparameters hinder a fair comparison between methods. To address these challenges, we present a systematically curated and processed database collected from 26 public sources, representing an 8.6-fold increase compared to previous works. Additionally, we propose a state-of-the-art transformer based completion technique for inferring missing gene expression, which significantly boosts the performance of transcriptomic profile predictions across all datasets. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on spatial transcriptomics.
Paper Structure (16 sections, 3 equations, 4 figures)

This paper contains 16 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: (a) Organisms and tissues available in SpaRED, along with the number of spots available from each tissue. (b) Prediction Pearson Correlation Coefficient for each model across all the datasets in SpaRED. For each dataset, the state-of-the-art model that obtains the highest Pearson Correlation Coefficient is included.
  • Figure 2: Overview of our data completion framework using a transformer-based model.
  • Figure 3: Completion results: Violin plot displaying completion MSE scores for each method (SpaCKLE, Median and stLearn) across all datasets in SpaRED (upper left). Line plot displaying completion MSE for the median and SpaCKLE methods across different percentages of synthetically masked data (middle left). Qualitative results showing gene completion for increasing synthetic masking percentages (row 1) with the median method (row 2) and SpaCKLE (row 3).
  • Figure 4: (a) Violin plot: normalized prediction MSE of each model across all datasets within SpaRED, with normalization done against the best MSE obtained on each dataset. The mean and standard deviation of the methods are included at the top of each violin. Pie chart: percentage of datasets within SpaRED for which each model achieves the best prediction MSE. (b) Mean normalized prediction MSE against the number of trainable parameters for each model.