Table of Contents
Fetching ...

Label Critic: Design Data Before Models

Pedro R. A. S. Bassi, Qilong Wu, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Alan Yuille, Zongwei Zhou

TL;DR

This work developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons, and finds that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes.

Abstract

As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don't use a sledgehammer on a nut, we find that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes. To obtain the Best-AI Labels among multiple AI Labels, we developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons. Extensive experiments demonstrate that, when incorporated with our developed Image-Prompt pairs, pre-existing Large Vision-Language Models (LVLM), trained on natural images and texts, achieve 96.5% accuracy when choosing the best label in a pair-wise comparison, without extra fine-tuning. By transforming the manual annotation task (30-60 min/scan) into an automatic comparison task (15 sec/scan), we effectively reduce the manual efforts required from radiologists by an order of magnitude. When the Best-AI Labels are sufficiently accurate (81% depending on body structures), they will be directly adopted as the gold-standard annotations for the dataset, with lower-quality AI Labels automatically discarded. Label Critic can also check the label quality of a single AI Label with 71.8% accuracy when no alternatives are available for comparison, prompting radiologists to review and edit if the estimated quality is low (19% depending on body structures).

Label Critic: Design Data Before Models

TL;DR

This work developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons, and finds that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes.

Abstract

As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don't use a sledgehammer on a nut, we find that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes. To obtain the Best-AI Labels among multiple AI Labels, we developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons. Extensive experiments demonstrate that, when incorporated with our developed Image-Prompt pairs, pre-existing Large Vision-Language Models (LVLM), trained on natural images and texts, achieve 96.5% accuracy when choosing the best label in a pair-wise comparison, without extra fine-tuning. By transforming the manual annotation task (30-60 min/scan) into an automatic comparison task (15 sec/scan), we effectively reduce the manual efforts required from radiologists by an order of magnitude. When the Best-AI Labels are sufficiently accurate (81% depending on body structures), they will be directly adopted as the gold-standard annotations for the dataset, with lower-quality AI Labels automatically discarded. Label Critic can also check the label quality of a single AI Label with 71.8% accuracy when no alternatives are available for comparison, prompting radiologists to review and edit if the estimated quality is low (19% depending on body structures).

Paper Structure

This paper contains 6 sections, 1 figure, 1 table, 1 algorithm.

Figures (1)

  • Figure 1: (a) Public CT datasets with per-voxel labels are rapidly expanding, largely due to AI-assisted labeling. However, AI often makes obvious errors, exampled in the liver, IVC, and kidneys, highlighting the need for efficient, automated error detection. (b) Label Critic pipeline for comparing labels. (I) Frontally project (§\ref{['sec:projections']}) the CT scan and overlay it with the projections of two candidate labels (red), $y_{1}$ and $y_{2}$, creating two images; (II) verify the dice score (DSC) between the 2 label projections, skip the comparison if DSC is above a class-specific threshold---avoiding comparing overly similar labels; (III) ask a LVLM (§\ref{['sec:LVLMs']}) to compare the labels and choose the most correct. If $y_{1}$ is a dataset label we are evaluating, we consider it wrong if the LVLM prefers $y_{2}$, the output of an alternative public segmentation model. (c) 3-Step Prompt Design.Prompt 1 asks if the target organ should be in the CT, providing a skeleton projection as reference. If the LVLM says no, we select an empty label (if available) or flag the case for review. Otherwise, Prompt 2 asks the LVLM to compare two label overlays using class-aware prompts with anatomical guidance, optional in-context learning, and complexity based on the LVLM's background knowledge of each class (§\ref{['sec:prompt']}). Prompt 3 asks the LVLM to summarizes its previous answer. Summarization provides an easily processable binary answer, but allows detailed justifications and step-by-step reasoning in earlier steps.