Table of Contents
Fetching ...

HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

Xin Zhang, Liangxiu Han, Yue Shi, Yanlin Zheng, Uazman Alam, Maryam Ferdousi, Rayaz Malik

TL;DR

HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications, and detailed ablation studies reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training.

Abstract

Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.

HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

TL;DR

HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications, and detailed ablation studies reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training.

Abstract

Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.

Paper Structure

This paper contains 27 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of the proposed method. 1) HMSViT is designed to learn multi-scale representations across four stages through pooling based dual attention mechanisms; 2) A block masked self-supervised learning is used to pretrain the HMSViT to enhance the model's spatial reasoning by reconstructing masked input blocks; 3) A multi-scale decoder is designed for downstream tasks including corneal nerve segmentation and Diabetic neuropathy diagnosis.
  • Figure 2: The architecture of the Hierarchical Multi-Scale Vision Transformer (HMSViT). An input image is processed through four hierarchical stages. Early stages use block-based local attention on high-resolution feature maps, while later stages switch to global attention after pooling reduces the spatial dimensions. This dual-attention approach balances computational efficiency with the ability to capture both local details and global context.
  • Figure 3: An overview of the block-masked self-supervised learning framework. Patches are grouped into blocks, a portion of which are randomly masked. The HMSViT encoder processes the visible blocks and is trained to reconstruct the original input, forcing it to learn robust spatial representations.
  • Figure 4: Architecture of the multi-scale decoder. For the segmentation task, hierarchical features from all stages are upsampled and fused to generate a high-resolution segmentation map. For the classification task, the global feature from the final stage is processed by a Multi-Layer Perceptron (MLP) to produce the diagnostic prediction.
  • Figure 5: Visualisation of corneal nerve fibre analysis from corneal confocal microscopy (CCM) images. Top row: original corneal nerve images from three subjects. Middle row: ground-truth manual annotations (cyan traces). Bottom row: automated nerve fibre segmentation showing main nerves (red) and branches (blue), with detected branching points (green dots).
  • ...and 2 more figures