DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Ravi Mosalpuri; Mohammed Abdelsamea; Ahmed Karam Eldaly

DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Ravi Mosalpuri, Mohammed Abdelsamea, Ahmed Karam Eldaly

Abstract

Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.

DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Abstract

Paper Structure (16 sections, 1 equation, 5 figures, 6 tables)

This paper contains 16 sections, 1 equation, 5 figures, 6 tables.

Introduction
Related Work
Methodology
Datasets
Data Preparation and Preprocessing
Model Architecture
Model Training
Experimental Results
Hyperparameter Optimisation
Evaluation Protocol
Evaluation Metrics
Quantitative Results
Attention Visualisation
Comparative Analysis with State-of-the-Art Models
Discussion
...and 1 more sections

Figures (5)

Figure 1: Example histopathological images from the datasets used in this study. Top row: Colon tissue samples from the LC25000 dataset showing (h) benign colon tissue and (i) colon adenocarcinoma. Middle row: Lung tissue samples showing (a) benign lung tissue, (b) squamous cell carcinoma, and (c) adenocarcinoma. Bottom row: Peripheral blood smear images from the acute lymphoblastic leukaemia (ALL) dataset showing (d) benign cells, (e) early Pre-B lymphoblasts, (f) Pre-B lymphoblasts, and (g) Pro-B lymphoblasts. All histopathological images are stained using hematoxylin and eosin (H&E).
Figure 2: Overview of the proposed Vision Transformer (ViT‑16) based pipeline for histopathology image classification. The framework consists of three stages: (i) Preprocessing, where raw histopathology images are resized to 256 pixels, augmented through random resizing, horizontal flipping, rotation, and colour jitter, followed by ImageNet normalisation; (ii) Feature Extraction, leveraging a ViT‑16 base model pre‑trained on ImageNet to obtain the CLS token representation; and (iii) Custom Classification Head, comprising sequential dense layers (512 and 256 units) with batch normalization, ReLU activation, and configurable dropout, culminating in a sigmoid‑activated output layer. This architecture integrates transformer‑based global feature learning with a tailored classification head to enhance performance on medical image analysis tasks.
Figure 3: Confusion matrices for the three evaluated datasets: lung cancer (LC25000), colon cancer (LC25000), and acute lymphoblastic leukaemia (ALL). The lung and colon datasets show perfect classification performance, while the ALL dataset shows near-perfect classification with minimal misclassification.
Figure 4: Training and validation accuracy and loss curves across datasets, demonstrating stable convergence and consistent generalisation performance.
Figure 5: Attention visualisation generated by the proposed DeepHistoViT model. For each example, the original histopathology image is shown alongside the corresponding attention map, highlighting regions that contribute most strongly to the classification decision. The attention maps demonstrate spatial localisation of diagnostically relevant morphological features.

DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Abstract

DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification

Authors

Abstract

Table of Contents

Figures (5)