MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

Minsoo Lee; Jonghyun Kim; Juseung Yun; Sunwoo Yu; Jongseong Jang

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

Minsoo Lee, Jonghyun Kim, Juseung Yun, Sunwoo Yu, Jongseong Jang

TL;DR

MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers, achieves the best overall performance on both HEST-Bench for gene expression prediction and general pathology tasks, demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.

Abstract

Pathology foundation models learn morphological representations through self-supervised pretraining on large-scale whole-slide images, yet they do not explicitly capture the underlying molecular state of the tissue. Spatial transcriptomics technologies bridge this gap by measuring gene expression in situ, offering a natural cross-modal supervisory signal. We propose MINT (Molecularly Informed Training), a fine-tuning framework that incorporates spatial transcriptomics supervision into pretrained pathology Vision Transformers. MINT appends a learnable ST token to the ViT input to encode transcriptomic information separately from the morphological CLS token, preventing catastrophic forgetting through DINO self-distillation and explicit feature anchoring to the frozen pretrained encoder. Gene expression regression at both spot-level (Visium) and patch-level (Xenium) resolutions provides complementary supervision across spatial scales. Trained on 577 publicly available HEST samples, MINT achieves the best overall performance on both HEST-Bench for gene expression prediction (mean Pearson r = 0.440) and EVA for general pathology tasks (0.803), demonstrating that spatial transcriptomics supervision complements morphology-centric self-supervised pretraining.

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 1 figure, 3 tables)

This paper contains 16 sections, 5 equations, 1 figure, 3 tables.

Introduction
Method
Overview
ST Token Design
Training Objectives
DINO Self-Distillation.
Feature Distillation.
Spot-Level ST Regression.
Patch-Level Xenium Regression.
Total Objective.
Experiments
Training Setup
Benchmarks and Protocols
Comparison with Existing Models
Representation Analysis
...and 1 more sections

Figures (1)

Figure 1: MINT framework. The student ViT, augmented with a learnable ST token, outputs CLS, ST, and patch token representations. Gene expression regression from the ST token ($\mathcal{L}_{\text{ST}}$) and patch tokens ($\mathcal{L}_{\text{pST}}$) provides transcriptomic supervision, while DINO self-distillation ($\mathcal{L}_{\text{DINO}}$) and feature anchoring ($\mathcal{L}_{\text{distill}}$) preserve morphological representations. Only the student is updated via backpropagation.

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

TL;DR

Abstract

MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (1)