Table of Contents
Fetching ...

SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models

Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova

TL;DR

This work introduces SPIDER, the largest public patch-level histopathology resource spanning multiple organs (Skin, Colorectal, Thorax, Breast) with expert-validated labels and surrounding context patches. A Hibou-L–based baseline architecture uses a frozen feature extractor and an attention-based head to fuse a central patch with 24 context patches in a $5\times5$ grid for accurate patch-level classification and potential WSI segmentation. The dataset employs a semi-automatic annotation pipeline, Faiss-based similarity retrieval, and pathologist verification, and is accompanied by thorough ablations showing the value of context. The authors release both SPIDER and the baselines openly to enable robust cross-organ benchmarking, rapid pathology insights, and groundwork for multimodal AI approaches in digital pathology.

Abstract

Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, Thorax, and Breast with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: https://github.com/HistAI/SPIDER

SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models

TL;DR

This work introduces SPIDER, the largest public patch-level histopathology resource spanning multiple organs (Skin, Colorectal, Thorax, Breast) with expert-validated labels and surrounding context patches. A Hibou-L–based baseline architecture uses a frozen feature extractor and an attention-based head to fuse a central patch with 24 context patches in a grid for accurate patch-level classification and potential WSI segmentation. The dataset employs a semi-automatic annotation pipeline, Faiss-based similarity retrieval, and pathologist verification, and is accompanied by thorough ablations showing the value of context. The authors release both SPIDER and the baselines openly to enable robust cross-organ benchmarking, rapid pathology insights, and groundwork for multimodal AI approaches in digital pathology.

Abstract

Advancing AI in computational pathology requires large, high-quality, and diverse datasets, yet existing public datasets are often limited in organ diversity, class coverage, or annotation quality. To bridge this gap, we introduce SPIDER (Supervised Pathology Image-DEscription Repository), the largest publicly available patch-level dataset covering multiple organ types, including Skin, Colorectal, Thorax, and Breast with comprehensive class coverage for each organ. SPIDER provides high-quality annotations verified by expert pathologists and includes surrounding context patches, which enhance classification performance by providing spatial context. Alongside the dataset, we present baseline models trained on SPIDER using the Hibou-L foundation model as a feature extractor combined with an attention-based classification head. The models achieve state-of-the-art performance across multiple tissue categories and serve as strong benchmarks for future digital pathology research. Beyond patch classification, the model enables rapid identification of significant areas, quantitative tissue metrics, and establishes a foundation for multimodal approaches. Both the dataset and trained models are publicly available to advance research, reproducibility, and AI-driven pathology development. Access them at: https://github.com/HistAI/SPIDER

Paper Structure

This paper contains 23 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Dataset preparation pipeline: Raw whole-slide images (WSIs) undergo expert annotation, patch extraction, feature embedding, and similarity-based retrieval. A final verification step ensures high-quality labeled patches for training.
  • Figure 2: Model architecture overview: The classifier processes a central patch alongside surrounding context patches. Features are extracted using the Hibou-L model, and an attention-based classification head integrates context information to improve central patch classification.
  • Figure A1: Dataset skin class distribution
  • Figure A2: Dataset colorectal class distribution
  • Figure A3: Dataset thorax class distribution
  • ...and 2 more figures