Table of Contents
Fetching ...

Vim4Path: Self-Supervised Vision Mamba for Histopathology Images

Ali Nasiri-Sarvi, Vincent Quoc-Huy Trinh, Hassan Rivaz, Mahdi S. Hosseini

TL;DR

The paper tackles robust representation learning for gigapixel histopathology WSIs under weak supervision by introducing Vision Mamba (Vim) as a self-supervised encoder within the DINO framework. Through experiments on Camelyon16, Vim demonstrates superior performance to ViT at small model scales and remains competitive as model size grows, with notable improvements in slide-level AUC and competitive patch-level results. Explainability analyses using Grad-CAM indicate that Vim emphasizes pathologist-relevant features (e.g., intracellular mucin and adjacent lymphocytes), suggesting alignment with clinical diagnostic workflows. The work advances SSL and MIL in computational pathology by combining a state-space–inspired encoder with self-distillation, and it provides code and pretrained weights to support further research and potential clinical deployment.

Abstract

Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{https://github.com/AtlasAnalyticsLab/Vim4Path}.

Vim4Path: Self-Supervised Vision Mamba for Histopathology Images

TL;DR

The paper tackles robust representation learning for gigapixel histopathology WSIs under weak supervision by introducing Vision Mamba (Vim) as a self-supervised encoder within the DINO framework. Through experiments on Camelyon16, Vim demonstrates superior performance to ViT at small model scales and remains competitive as model size grows, with notable improvements in slide-level AUC and competitive patch-level results. Explainability analyses using Grad-CAM indicate that Vim emphasizes pathologist-relevant features (e.g., intracellular mucin and adjacent lymphocytes), suggesting alignment with clinical diagnostic workflows. The work advances SSL and MIL in computational pathology by combining a state-space–inspired encoder with self-distillation, and it provides code and pretrained weights to support further research and potential clinical deployment.

Abstract

Representation learning from Gigapixel Whole Slide Images (WSI) poses a significant challenge in computational pathology due to the complicated nature of tissue structures and the scarcity of labeled data. Multi-instance learning methods have addressed this challenge, leveraging image patches to classify slides utilizing pretrained models using Self-Supervised Learning (SSL) approaches. The performance of both SSL and MIL methods relies on the architecture of the feature encoder. This paper proposes leveraging the Vision Mamba (Vim) architecture, inspired by state space models, within the DINO framework for representation learning in computational pathology. We evaluate the performance of Vim against Vision Transformers (ViT) on the Camelyon16 dataset for both patch-level and slide-level classification. Our findings highlight Vim's enhanced performance compared to ViT, particularly at smaller scales, where Vim achieves an 8.21 increase in ROC AUC for models of similar size. An explainability analysis further highlights Vim's capabilities, which reveals that Vim uniquely emulates the pathologist workflow-unlike ViT. This alignment with human expert analysis highlights Vim's potential in practical diagnostic settings and contributes significantly to developing effective representation-learning algorithms in computational pathology. We release the codes and pretrained weights at \url{https://github.com/AtlasAnalyticsLab/Vim4Path}.
Paper Structure (19 sections, 3 equations, 5 figures, 9 tables)

This paper contains 19 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Detailed architecture of VIM within the DINO framework. We modify the Vim model to adapt to input image size for positional embedding interpolation and employ the modified model within DINO as a backbone architecture for self-supervised learning.
  • Figure 2: Comparison between different architecture designs. Vim sequential processing allows the model to capture both short-range and long-range dependencies.
  • Figure 3: Sequential processing of Vim done on each patch level from slide for feature embedding. This is similar to the lawnmower pattern used for slide navigation by pathologists to study cellular neighbourhoods in the tissue for cancer diagnosis. The information from each patch (i.e. embeddings) are put together to reach to a consensus on the slide level (i.e. aggregation).
  • Figure 4: Representative tumor patch with Vim-s heatmap. The red asterisks highlight intracellular mucin in cancer cells. The yellow asterisks highlight stromal features adjacent to cancer cells. (The heatmaps are generated at 10x and overlaid on 40x images.)
  • Figure 5: Representative tumor patch with ViT-s heatmap. The red asterisks highlight areas centralized on cancer cells. The yellow asterisks highlight other features, notably a focus of intracellular mucin (top-right) and a stromal cell (middle). (The heatmaps are generated at 10x and overlaid on 40x images.)