Table of Contents
Fetching ...

Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy

Stanislav Nikolov, Sam Blackwell, Alexei Zverovitch, Ruheena Mendes, Michelle Livne, Jeffrey De Fauw, Yojan Patel, Clemens Meyer, Harry Askham, Bernardino Romera-Paredes, Christopher Kelly, Alan Karthikesalingam, Carlton Chu, Dawn Carnell, Cheng Boon, Derek D'Souza, Syed Ali Moinuddin, Bethany Garie, Yasmin McQuinlan, Sarah Ireland, Kiarna Hampton, Krystle Fuller, Hugh Montgomery, Geraint Rees, Mustafa Suleyman, Trevor Back, Cían Hughes, Joseph R. Ledsam, Olaf Ronneberger

TL;DR

The paper tackles the clinical bottleneck of manual head-and-neck OAR delineation for radiotherapy by introducing a 3D residual U-Net that segments 21 OARs on planning CT. It proposes a clinically oriented surface Dice similarity coefficient to better reflect the effort required to correct automated segmentations and demonstrates expert-level performance on a held-out UCLH test set, with strong generalization to external datasets (TCIA and PDDCA). While most OARs meet human variability benchmarks, two organs with poorer image quality (brainstem, right lens) reveal remaining limitations. The work provides open data resources and tooling to enable objective benchmarking, suggesting substantial potential to improve safety, efficiency, and consistency in radiotherapy planning, pending regulatory validation and prospective clinical study.

Abstract

Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. While auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying and achieving expert performance remain. Adopting a deep learning approach, we demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck OARs commonly segmented in clinical practice. The model was trained on a dataset of 663 deidentified computed tomography (CT) scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus OAR definitions. We demonstrate the model's clinical applicability by assessing its performance on a test set of 21 CT scans from clinical practice, each with the 21 OARs segmented by two independent experts. We also introduce surface Dice similarity coefficient (surface DSC), a new metric for the comparison of organ delineation, to quantify deviation between OAR surface contours rather than volumes, better reflecting the clinical task of correcting errors in the automated organ segmentations. The model's generalisability is then demonstrated on two distinct open source datasets, reflecting different centres and countries to model training. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways.

Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy

TL;DR

The paper tackles the clinical bottleneck of manual head-and-neck OAR delineation for radiotherapy by introducing a 3D residual U-Net that segments 21 OARs on planning CT. It proposes a clinically oriented surface Dice similarity coefficient to better reflect the effort required to correct automated segmentations and demonstrates expert-level performance on a held-out UCLH test set, with strong generalization to external datasets (TCIA and PDDCA). While most OARs meet human variability benchmarks, two organs with poorer image quality (brainstem, right lens) reveal remaining limitations. The work provides open data resources and tooling to enable objective benchmarking, suggesting substantial potential to improve safety, efficiency, and consistency in radiotherapy planning, pending regulatory validation and prospective clinical study.

Abstract

Over half a million individuals are diagnosed with head and neck cancer each year worldwide. Radiotherapy is an important curative treatment for this disease, but it requires manual time consuming delineation of radio-sensitive organs at risk (OARs). This planning process can delay treatment, while also introducing inter-operator variability with resulting downstream radiation dose differences. While auto-segmentation algorithms offer a potentially time-saving solution, the challenges in defining, quantifying and achieving expert performance remain. Adopting a deep learning approach, we demonstrate a 3D U-Net architecture that achieves expert-level performance in delineating 21 distinct head and neck OARs commonly segmented in clinical practice. The model was trained on a dataset of 663 deidentified computed tomography (CT) scans acquired in routine clinical practice and with both segmentations taken from clinical practice and segmentations created by experienced radiographers as part of this research, all in accordance with consensus OAR definitions. We demonstrate the model's clinical applicability by assessing its performance on a test set of 21 CT scans from clinical practice, each with the 21 OARs segmented by two independent experts. We also introduce surface Dice similarity coefficient (surface DSC), a new metric for the comparison of organ delineation, to quantify deviation between OAR surface contours rather than volumes, better reflecting the clinical task of correcting errors in the automated organ segmentations. The model's generalisability is then demonstrated on two distinct open source datasets, reflecting different centres and countries to model training. With appropriate validation studies and regulatory approvals, this system could improve the efficiency, consistency, and safety of radiotherapy pathways.

Paper Structure

This paper contains 21 sections, 7 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: A typical clinical pathway for radiotherapy. After a patient is diagnosed and the decision is made to treat with radiotherapy, a defined workflow aims to provide treatment that is both safe and effective. In the UK the time delay between decision to treat and treatment delivery should be no greater than 31 daysnhscancerplan. Time-intensive manual segmentation and dose optimisation steps can introduce delays to treatment.
  • Figure 2: Example results. (CT image) Axial slices at five representative levels from the raw CT scan of a 55-59 year old male patient was selected from the UCLH dataset (patient UCLH-20) were selected to best demonstrate the OARs included in the work. The levels shown as 2D slices have been selected to demonstrate all 21 OARs included in this study. The window levelling has been adjusted for each to best display the anatomy present. (Oncologist contour) The ground truth segmentation, as defined by experienced radiographers and arbitrated by a head and neck specialist oncologist. (Model contour) Segmentations produced by our model. (Contour comparison) Contoured by Oncologist only (green region) or Model only (yellow region). Two further randomly selected UCLH set scans are shown in \ref{['fig:examples_uclh_2']} and \ref{['fig:examples_uclh_3']}. Best viewed on a display.
  • Figure 3: Surface DSC performance metric. (a) Illustration of the computation of the surface DSC. Continuous line: predicted surface. Dashed line: ground truth surface. Black arrow: the maximum margin of deviation which may be tolerated without penalty, hereafter referred to by $\tau$. Note that in our use case each OAR has an independently calculated value for $\tau$. Green: acceptable surface parts (distance between surfaces $\leq\tau$). Pink: unacceptable regions of the surfaces (distance between surfaces $>\tau$). The proposed surface DSC metric reports the good surface parts compared to the total surface (sum of predicted surface area and ground truth surface area). (b) Illustration of the determination of the organ-specific tolerance. Green: segmentation of an organ by oncologist A. Black: segmentation by oncologist B. Red: distances between the surfaces. We defined the organ-specific tolerance as the 95th percentile of the distances collected across multiple segmentations from a subset of seven TCIA scans, where each segmentation was performed a radiographer and then arbitrated by an oncologist, neither of whom had seen the scan previously.
  • Figure 4: UCLH test set: Quantitative performance of the model in comparison to radiographers. (a) The model achieves a surface DSC similar to humans in all 21 organs at risk (on the UCLH held out test set) when compared to the gold standard for each organ at an organ-specific tolerance $\tau$. Blue: our model, green: radiographers. (b) Performance difference between the model and the radiographers. Each blue dot represents a model-radiographer pair. The grey area highlights non-substantial differences (-5% to +5%). The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers indicate most extreme, non-outlier data points. Where data lies outside 1.5 $\times$ interquartile range it is represented as a circular flier. The notches represent the 95% confidence interval (CI) around the median.
  • Figure 5: Model generalisability to an independent test set from TCIA. Quantitative performance of the model on the TCIA test set in comparison to radiographers. (a) Surface DSC (on the TCIA open source test set) for the segmentations compared to the gold standard for each organ at an organ-specific tolerance $\tau$. Blue: our model, green: radiographers. (b) Performance difference between the model and the radiographers. Each blue dot represents a model-radiographer pair. Red lines show the mean difference. The grey area highlights non-substantial differences (-5% to +5%). The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data, except where data lies outside 1.5 $\times$ interquartile range, which is represented as a circular flier. The notches represent the 95% confidence interval (CI) around the median.
  • ...and 12 more figures