LEAVS: An LLM-based Labeler for Abdominal CT Supervision
Ricardo Bigolin Lanfredi, Yan Zhuang, Mark Finkelstein, Praveen Thoppey Srinivasan Balamuralikrishna, Luke Krembs, Brandon Khoury, Arthi Reddy, Pritam Mukherjee, Neil M. Rofsky, Ronald M. Summers
TL;DR
This work introduces LEAVS, a large-language-model-based labeler designed to extract structured, multi-faceted labels from abdominal CT reports, addressing the gap in abdominal radiology labeling. The pipeline uses sentence filtration, a tree-structured, multiple-choice finding-type prompt, and urgency assessment to annotate nine organs across seven abnormality types, with zero-shot prompting on a locally run LLM. LEAVS achieves a high abnormality-labeling F1 (~0.89) and urgency labeling on par with human annotations, outperforming baselines like MAPLEZ and SARLE, and provides labeled data to train a downstream vision classifier that attains meaningful AUCs (average around 0.716) across several finding types. The approach demonstrates domain transfer to abdominal CTs, offers a release of code and AMOS-MM annotations, and points toward scalable, universal abdominal abnormality detection with potential real-world impact in radiology workflows.
Abstract
Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes.
