Table of Contents
Fetching ...

CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu

TL;DR

Multiple state-of-the-art multimodal models are evaluated by comparing their performance to radiologist assessments by comparing their performance to CT datasets with lesion-level annotations, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis.

Abstract

Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.

CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

TL;DR

Multiple state-of-the-art multimodal models are evaluated by comparing their performance to radiologist assessments by comparing their performance to CT datasets with lesion-level annotations, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis.

Abstract

Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
Paper Structure (14 sections, 4 figures, 8 tables)

This paper contains 14 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of CT-Bench. (a) Example of data annotation, showing the transformation of original reports from PACS into detailed descriptions enriched with size information for CT-Bench: Lesion Image & Metadata Set. (b) Analysis of description: word count and value distribution. (c) Distribution of lesion locations. (d) CT-Bench: components of the QA benchmark cases.
  • Figure 2: Heatmap of performance change with BBox: model performance differences across tasks when BBox annotations are used. Positive values indicate performance gains due to BBox, while negative values indicate performance drops.
  • Figure 3: Comparison of model performance on single-slice (Img2txt / Img2attrib) versus multi-slice context tasks (Context2txt / Context2attrib).
  • Figure 4: Illustration of the selection process for "hard negative" cases using the BiomedCLIP model and MD validation.