Table of Contents
Fetching ...

Rethinking Lung Cancer Screening: AI Nodule Detection and Diagnosis Outperforms Radiologists, Leading Models, and Standards Beyond Size and Growth

Sylvain Bodard, Pierre Baudot, Benjamin Renoust, Charles Voyton, Gwendoline De Bie, Ezequiel Geremia, Van-Khoa Le, Danny Francis, Pierre-Henri Siot, Yousra Haddou, Vincent Bobin, Jean-Christophe Brisset, Carey C. Thomson, Valerie Bourdes, Benoit Huet

TL;DR

This work redefines lung cancer screening by performing both detection and nodule-level malignancy diagnosis on LDCT scans, using a factorized ensemble of shallow models and radiomics to overcome data and explainability limits. The system directly predicts malignancy at the nodule level and integrates context from full-volume CT data, achieving an AUC of 0.98 on internal tests and 0.945 on an independent cohort, while maintaining 0.5 false positives per scan and 99.3% sensitivity. Across sizes and early-stage cancers, the model outperforms radiologists, Lung-RADS, and several leading AI baselines (Sybil, Liao, Ardila, NLST Brock, Mayo), and it surpasses radiologist performance in indeterminate/slow-growing nodules by up to a year. The approach leverages a modular, large-enrolment ensemble combining 3D/2D CNNs, radiomics, and a full-CT context model with calibrated stacking, demonstrating substantial potential to reduce unnecessary follow-ups and enable earlier intervention in lung cancer screening programs.

Abstract

Early detection of malignant lung nodules is critical, but its dependence on size and growth in screening inherently delays diagnosis. We present an AI system that redefines lung cancer screening by performing both detection and malignancy diagnosis directly at the nodule level on low-dose CT scans. To address limitations in dataset scale and explainability, we designed an ensemble of shallow deep learning and feature-based specialized models. Trained and evaluated on 25,709 scans with 69,449 annotated nodules, the system outperforms radiologists, Lung-RADS, and leading AI models (Sybil, Brock, Google, Kaggle). It achieves an area under the receiver operating characteristic curve (AUC) of 0.98 internally and 0.945 on an independent cohort. With 0.5 false positives per scan at 99.3\% sensitivity, it addresses key barriers to AI adoption. Critically, it outperforms radiologists across all nodule sizes and stages, excelling in stage 1 cancers, and all growth-based metrics, including the least accurate: Volume-Doubling Time. It also surpasses radiologists by up to one year in diagnosing indeterminate and slow-growing nodules.

Rethinking Lung Cancer Screening: AI Nodule Detection and Diagnosis Outperforms Radiologists, Leading Models, and Standards Beyond Size and Growth

TL;DR

This work redefines lung cancer screening by performing both detection and nodule-level malignancy diagnosis on LDCT scans, using a factorized ensemble of shallow models and radiomics to overcome data and explainability limits. The system directly predicts malignancy at the nodule level and integrates context from full-volume CT data, achieving an AUC of 0.98 on internal tests and 0.945 on an independent cohort, while maintaining 0.5 false positives per scan and 99.3% sensitivity. Across sizes and early-stage cancers, the model outperforms radiologists, Lung-RADS, and several leading AI baselines (Sybil, Liao, Ardila, NLST Brock, Mayo), and it surpasses radiologist performance in indeterminate/slow-growing nodules by up to a year. The approach leverages a modular, large-enrolment ensemble combining 3D/2D CNNs, radiomics, and a full-CT context model with calibrated stacking, demonstrating substantial potential to reduce unnecessary follow-ups and enable earlier intervention in lung cancer screening programs.

Abstract

Early detection of malignant lung nodules is critical, but its dependence on size and growth in screening inherently delays diagnosis. We present an AI system that redefines lung cancer screening by performing both detection and malignancy diagnosis directly at the nodule level on low-dose CT scans. To address limitations in dataset scale and explainability, we designed an ensemble of shallow deep learning and feature-based specialized models. Trained and evaluated on 25,709 scans with 69,449 annotated nodules, the system outperforms radiologists, Lung-RADS, and leading AI models (Sybil, Brock, Google, Kaggle). It achieves an area under the receiver operating characteristic curve (AUC) of 0.98 internally and 0.945 on an independent cohort. With 0.5 false positives per scan at 99.3\% sensitivity, it addresses key barriers to AI adoption. Critically, it outperforms radiologists across all nodule sizes and stages, excelling in stage 1 cancers, and all growth-based metrics, including the least accurate: Volume-Doubling Time. It also surpasses radiologists by up to one year in diagnosing indeterminate and slow-growing nodules.

Paper Structure

This paper contains 10 sections, 2 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Performance comparison: our model vs. radiologists, lung-RADS®, and SOTA models:a., Patient-level ROC curves for our model's malignancy prediction on Test1 and Independent Cohort. Our model's Operating points (OPs) at the Maximum Youden Index (MYI) are depicted with black tilted squares (in all panels of all figures except in d. where their colour match the colour of their corresponding curve). b., Patient-level ROC curves for our model's malignancy predictions on Test3 and Test4, compared with the mean of 4-Radiologists detection and Likelihood Of Malignancy assessment on Test3, as well as of six radiologists Lung-RADS® 2019 score assessments by six radiologists without prior Time Point (TP) evolution on Test4, as provided by Ardila et al.ardila_end--end_2019. Our model's OPs equivalent to each Lung-RADS® score are depicted with coloured circles matching the Lung-RADS® scores on the corresponding curves. c., Patient-level ROC curves for malignancy prediction comparing our model and Lung-RADS® 2019 score assessments by six radiologists with prior TP evolution on Test5 as provided by Ardila et al.ardila_end--end_2019. Same OP representation as in b. d., Patient-level ROC curves on Test2 (Sybil and Ardila et al. test set), comparing our model, Sybil (five ensembled model), Liao et al., Ardila et al. models, and the NLST Brock models using NLST GT for each nodule detected by NLST radiologists. e., the sensitivity and specificity with 95% CI over 5,000 bootstraps for each Lung-RADS® score without prior images, as assessed by six radiologists and for the corresponding accuracy equivalent OP for our model on Test4 (associated to the OPs in b.). f., same as e., but for Lung-RADS® with prior on Test5 (associated to the OPs in c.). g., Sensitivity, specificity and accuracy at the MYI of each ROC, along with the mean AUC with 95% CI over 5,000 bootstraps.
  • Figure 2: Subgroup demographics and model performance analysis: Subgroup definitions are detailed in Supplementary methods.a. The distribution of demographic and scan characteristics in Test1 and IC for cancer, non-cancer and all patients. Sample sizes for Canon and Philips manufacturers are too small to be represented and are therefore replaced by white space. b. Mean patient AUC for the various subgroups of our model on Test1 and IC. Vertical bars represent the 95% CI on 5,000 bootstraps. c. Table summarizing the values from a. and b. The sample size (n: number of patients) for each subgroup is given with the number of patients with cancer indicated in parenthesis. It provides the mean AUC and 95% CI over 5,000 bootstraps samples. 'NA' stands for 'Not Available': the presence of COPD at baseline was only available for the EU/AIR subset of IC and is 100% ($n=273$(91)), inclusion criterion).
  • Figure 3: Performance comparison: our model vs. individual radiologists: patient-level ROC curves for each of the 12 radiologist who annotated at least 15 patients with cancer and for our model on the same annotated sample of Test1, as detailed in Supplementary methods (the sample size is provided in the title of each ROC). Their mean AUC and CI over 5,000 bootstraps are indicated in the labels, and reported in detail in Supplementary Fig.2.
  • Figure 4: Comparison of detection performances: our model vs. radiologists, CADe and SOTA model, benign vs. malignant nodules (nnDetection), & performance comparison of our model vs. original nnDetection on LUNA16 challenge:a. The Free-response Receiver Operating Curves (FROC) on Test1 comparing our CADe/CADx model, and the nnDetection CADe/CADx model retrained on Train1 (malignant class detection only). Also shown: FROCs of our CADe-only module for the nodule detection task (malignant and benign class detection), of the nnDetection model retrained on NLST (both for malignant and benign class detection), and of the nnDetection model trained by Baumgartner et al.baumgartner_nndetection_2021 on LUNA16 (both for malignant and benign class detection). b. FROCs of our CADe/CADx model on Test3 and IC, compared to the mean performance of 4-Radiologists assessments. c. The FROC curves of the nnDetection CADe/CADx model retrained on NLST (Train3) for Benign and Malignant nodules detection tasks as separate classes on Test1 (the Malignant nodule detection curve is the same as in a). d. FROC curve of our CADe model trained on NLST alongside the nnDetection model trained by Baumgartner et al.baumgartner_nndetection_2021 on LUNA16, on the LUNA16 nodule detection challenge dataset. The Competition Performance Metric (CPM) of 0.929 achieved in this study for the Baumgartner et al. model reproduces their published CPM of 0.930, thereby "outperforming all previous methods on the nodule-candidate-detection task"baumgartner_nndetection_2021. e. Table of the mean sensitivity over 5,000 bootstraps samples (in percent) at FROC OPs closest to 1 and 0.5 mean FP/scan for the various predictions in a,b,c,d, with 95% CI over 5,000 bootstraps samples. The CPM for LUNA16 challenge task are also included.
  • Figure 5: Performance across size-based subgroups:a. Patient-level ROC curves on Test3 comparing our model's malignancy prediction and the mean of 4-Radiologists’ assessments for patients whose largest nodule has a diameter GT within $[4,10]$mm range, and for stage IA (vs. non-cancer) patients. b. Table summarizing panels a,c,d and e of patient- and nodule-level AUCs for our model and for the 4-Radiologists across nodule size and cancer stage subpopulations with the corresponding malignant and benign sample size. $95\%$ are provided over 5,000 bootstraps. The discrepancy between the 19 malignant nodules (nodule level) and the 11 cancers (patient-level) reflects patients with multiple malignant nodules and differences in diameter between GT and our model estimates. c. Mean AUC for subgroups of patients whose largest nodule has a diameter GT within $[4,10[$mm, $[10,20[$mm and $[20,30]$mm ranges on Test3. Horizontal bars represent the 95% CI over 5,000 bootstraps (as for other panels). d. Nodule-level mean AUC of the mean of four Radiologists' findings compared to our model's findings with computed diameter in the $[4,10[$mm, $[10,20[$mm and $[20,30]$mm ranges on Test3. e. Mean AUC for patients with stage 1A, stage 1 (stage 1A and stage 1B), and late stages (stage >1), compared to non-cancer patient. This stratification is imposed by the small prevalence of stage 1B and late stages in NLST (see Supplementary Information). f. Our model's outputs for the 19 malignant nodules in the $[4-10[$mm range in GT. Pink contours represent the automatic segmentation of the nodule. Each CT patch is 40*40mm. Nodules are ordered by ascending malignancy prediction in reading order. Red dots mark the four nodules misclassified as malignant (False Negative at Maximum Youden Index). Orange dots denote the 11 nodules identified as the patient's largest nodule of the patient (as in b), in all other cases, the patient presents some larger malignant nodule.
  • ...and 14 more figures