Table of Contents
Fetching ...

An Attention Based Pipeline for Identifying Pre-Cancer Lesions in Head and Neck Clinical Images

Abdullah Alsalemi, Anza Shakeel, Mollie Clark, Syed Ali Khurram, Shan E Ahmed Raza

TL;DR

The paper addresses early detection of head and neck cancer by identifying potentially malignant oral lesions from clinical images. It introduces an attention-based pipeline combining a vision-transformer enhanced Mask R-CNN for lesion detection/segmentation with a MIL-based, VGG-16–driven classifier for three-class grading (non-dysplastic, dysplastic, cancerous). The approach achieves a Dice score of 57.1% and an 82.4% overlap accuracy on unseen data for segmentation, and an overall F1-score of 85.0% on internal data for classification, with an accompanying app demonstration. This multi-centre dataset–driven method offers a non-invasive triage tool for earlier diagnosis and lays groundwork for future integration with endoscopic video data.

Abstract

Early detection of cancer can help improve patient prognosis by early intervention. Head and neck cancer is diagnosed in specialist centres after a surgical biopsy, however, there is a potential for these to be missed leading to delayed diagnosis. To overcome these challenges, we present an attention based pipeline that identifies suspected lesions, segments, and classifies them as non-dysplastic, dysplastic and cancerous lesions. We propose (a) a vision transformer based Mask R-CNN network for lesion detection and segmentation of clinical images, and (b) Multiple Instance Learning (MIL) based scheme for classification. Current results show that the segmentation model produces segmentation masks and bounding boxes with up to 82% overlap accuracy score on unseen external test data and surpassing reviewed segmentation benchmarks. Next, a classification F1-score of 85% on the internal cohort test set. An app has been developed to perform lesion segmentation taken via a smart device. Future work involves employing endoscopic video data for precise early detection and prognosis.

An Attention Based Pipeline for Identifying Pre-Cancer Lesions in Head and Neck Clinical Images

TL;DR

The paper addresses early detection of head and neck cancer by identifying potentially malignant oral lesions from clinical images. It introduces an attention-based pipeline combining a vision-transformer enhanced Mask R-CNN for lesion detection/segmentation with a MIL-based, VGG-16–driven classifier for three-class grading (non-dysplastic, dysplastic, cancerous). The approach achieves a Dice score of 57.1% and an 82.4% overlap accuracy on unseen data for segmentation, and an overall F1-score of 85.0% on internal data for classification, with an accompanying app demonstration. This multi-centre dataset–driven method offers a non-invasive triage tool for earlier diagnosis and lays groundwork for future integration with endoscopic video data.

Abstract

Early detection of cancer can help improve patient prognosis by early intervention. Head and neck cancer is diagnosed in specialist centres after a surgical biopsy, however, there is a potential for these to be missed leading to delayed diagnosis. To overcome these challenges, we present an attention based pipeline that identifies suspected lesions, segments, and classifies them as non-dysplastic, dysplastic and cancerous lesions. We propose (a) a vision transformer based Mask R-CNN network for lesion detection and segmentation of clinical images, and (b) Multiple Instance Learning (MIL) based scheme for classification. Current results show that the segmentation model produces segmentation masks and bounding boxes with up to 82% overlap accuracy score on unseen external test data and surpassing reviewed segmentation benchmarks. Next, a classification F1-score of 85% on the internal cohort test set. An app has been developed to perform lesion segmentation taken via a smart device. Future work involves employing endoscopic video data for precise early detection and prognosis.
Paper Structure (10 sections, 7 equations, 3 figures, 2 tables)

This paper contains 10 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Block diagram of the proposed (A) vision transformer based segmentation and (B) VGG-16 based Multiple Instance Learning (MIL) classification pipeline.
  • Figure 2: Sample of clinical oral photographs from the Sheffield cohort representing non-dysplastic (green), dysplastic (orange) and cancerous (red) lesions. Expert annotations of affected lesions are shown in red on each image.
  • Figure 3: Top: F1-score for VGG-16, VGG-16 MIL, DenseNet and DenseNet MIL, where MIL is the network architecture shown in Fig. \ref{['fig:block-diagram']}B. Bottom: Sample images classified by the VGG-16 MIL model.