Table of Contents
Fetching ...

Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging

Alexander Shieh, Tejas Sudharshan Mathai, Jianfei Liu, Angshuman Paul, Ronald M. Summers

TL;DR

The paper addresses the challenge of incomplete annotations and class imbalance in universal lesion detection and tagging (ULDT) on CT. It introduces a self-training pipeline using VFNet trained on a small annotated DeepLesion subset to mine additional lesions from a larger unlabeled set, followed by iterative retraining across four rounds. A key contribution is balancing the mined data via upsampling, which, when combined with a variable-threshold mining policy, yields substantial gains: a 6.5 percentage-point increase in mean sensitivity at 4FP over a non-upsampled baseline (78.5% vs 72.0%), and an 11.7-point gain over the same policy without upsampling (78.5% vs 66.8%). The approach improves or maintains per-class sensitivity across all eight lesion classes, demonstrating the potential to exploit unannotated data for ULDT in clinical workflows, while highlighting areas for further refinement like class-aware thresholding and alternative semi-supervised methods.

Abstract

Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time. However, a lack of fully annotated data hinders the development of effective ULDT approaches. Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances. To address these issues, in this work, we developed a self-training pipeline for ULDT. A VFNet model was trained on a limited 11.5\% subset of DeepLesion (bounding boxes + tags) to detect and classify lesions in CT studies. Then, it identified and incorporated novel lesion candidates from a larger unseen data subset into its training set, and self-trained itself over multiple rounds. Multiple self-training experiments were conducted with different threshold policies to select predicted lesions with higher quality and cover the class imbalances. We discovered that direct self-training improved the sensitivities of over-represented lesion classes at the expense of under-represented classes. However, upsampling the lesions mined during self-training along with a variable threshold policy yielded a 6.5\% increase in sensitivity at 4 FP in contrast to self-training without class balancing (72\% vs 78.5\%) and a 11.7\% increase compared to the same self-training policy without upsampling (66.8\% vs 78.5\%). Furthermore, we show that our results either improved or maintained the sensitivity at 4FP for all 8 lesion classes.

Correcting Class Imbalances with Self-Training for Improved Universal Lesion Detection and Tagging

TL;DR

The paper addresses the challenge of incomplete annotations and class imbalance in universal lesion detection and tagging (ULDT) on CT. It introduces a self-training pipeline using VFNet trained on a small annotated DeepLesion subset to mine additional lesions from a larger unlabeled set, followed by iterative retraining across four rounds. A key contribution is balancing the mined data via upsampling, which, when combined with a variable-threshold mining policy, yields substantial gains: a 6.5 percentage-point increase in mean sensitivity at 4FP over a non-upsampled baseline (78.5% vs 72.0%), and an 11.7-point gain over the same policy without upsampling (78.5% vs 66.8%). The approach improves or maintains per-class sensitivity across all eight lesion classes, demonstrating the potential to exploit unannotated data for ULDT in clinical workflows, while highlighting areas for further refinement like class-aware thresholding and alternative semi-supervised methods.

Abstract

Universal lesion detection and tagging (ULDT) in CT studies is critical for tumor burden assessment and tracking the progression of lesion status (growth/shrinkage) over time. However, a lack of fully annotated data hinders the development of effective ULDT approaches. Prior work used the DeepLesion dataset (4,427 patients, 10,594 studies, 32,120 CT slices, 32,735 lesions, 8 body part labels) for algorithmic development, but this dataset is not completely annotated and contains class imbalances. To address these issues, in this work, we developed a self-training pipeline for ULDT. A VFNet model was trained on a limited 11.5\% subset of DeepLesion (bounding boxes + tags) to detect and classify lesions in CT studies. Then, it identified and incorporated novel lesion candidates from a larger unseen data subset into its training set, and self-trained itself over multiple rounds. Multiple self-training experiments were conducted with different threshold policies to select predicted lesions with higher quality and cover the class imbalances. We discovered that direct self-training improved the sensitivities of over-represented lesion classes at the expense of under-represented classes. However, upsampling the lesions mined during self-training along with a variable threshold policy yielded a 6.5\% increase in sensitivity at 4 FP in contrast to self-training without class balancing (72\% vs 78.5\%) and a 11.7\% increase compared to the same self-training policy without upsampling (66.8\% vs 78.5\%). Furthermore, we show that our results either improved or maintained the sensitivity at 4FP for all 8 lesion classes.

Paper Structure

This paper contains 7 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Self-training pipeline. "DL" stands for DeepLesion dataset. (b) A mined "bone" lesion of the thoracic vertebra. (c) A mined hypodense "liver" lesion. (d) A mined "mediastinum" lesion with 2 low-confidence FP also visible. (b)-(d) represent lesion mining results on the original DeepLesion training split. The green boxes are lesions that were found during self-training. The red boxes are FP with confidence $<$ 90% and were not selected during self-training. The blue boxes are original annotations provided by DeepLesion. They are only plotted for visualization, and were discarded during self-training.
  • Figure 2: The construction of our training, testing and validation splits.
  • Figure 3: The three threshold policies used for our data selection process.
  • Figure 4: The confusion matrix shown as heatmap for the VFNet model with upsampled self-training using the variable threshold policy $E_{V}$. The "abdomen" lesions are sometimes confused with liver and kidney lesions, possibly due to anatomical proximity.
  • Figure 5: CT slices with predicted lesions to demonstrate the self-training data selection process. The detected lesions are plotted as "score|lesion type" for images A-G and "score|lesion type|IOU with ground truth" in images H and I. Blue boxes: ground truth. Green boxes: lesions predicted with confidences $\geq$90%. Yellow boxes: lesions predicted with confidences $<$90% but with IOU $\geq$30% with the GT. Red boxes: false positives. Note the GT boxes are from the original DeepLesion training set and were not used in our study. They are plotted only for visualization. A-H: lesions were predicted correctly with the correct lesion type and included in self-training. A is a bone lesion located in the thoracic vertebral body. B is a lesion found at the right adrenal gland adjacent to the liver. C is a left mediastinal lesion, D is a liver mass, and E is a right lung nodule. F shows a large lesion in the left kidney, while there is a possible lesion in the right kidney as well. G is a lesion of the left inguinal lymph node. H show multiple pelvic lesions that were found in the pelvis. I shows an interesting case in which the selected kidney lesion with high confidence is actually abnormal but not labeled in ground truth. On the other hand, the ground truth lesion at the pancreatic head shown in I was detected but not selected (with a confidence score of 11). All images are cropped and zoomed in for better visualization.