Table of Contents
Fetching ...

Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Mohamed Sobhi Jabal, Jikai Zhang, Dominic LaBella, Jessica L. Houk, Dylan Zhang, Jeffrey D. Rudie, Kirti Magudia, Maciej A. Mazurowski, Evan Calabrese

Abstract

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.

Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

Abstract

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
Paper Structure (16 sections, 5 figures, 4 tables)

This paper contains 16 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the multi-agent system for automated BT-RADS classification. The pipeline integrates CNN-based tumor segmentation (nnU-Net) for volumetric quantification of FLAIR and enhancement volumes with a 20-billion parameter open-weight LLM (extractor agent) that identifies clinical variables (steroid status, bevacizumab use, radiation date) from unstructured clinical notes with evidence span linking, and a scorer agent that applies BT-RADS decision logic integrating extracted variables with quantitative volumetric measurements. An Orchestrator coordinates data flow between agents. Schema-constrained generation with Pydantic validation ensures outputs conform to predefined formats.
  • Figure 2: BT-RADS decision flowchart implemented by the scoring agent. Sequential rules classify cases based on volumetric thresholds ($\pm$20% stability, $>$40% major change), medication effects (bevacizumab, steroids), and radiation timing ($<$90 days post-completion). Terminal nodes correspond to BT-RADS categories BT-0 through BT-4.
  • Figure 3: Composite classification performance. (A) Overall accuracy: multi-agent system versus initial workflow (McNemar $P < .001$); error bars represent 95% confidence intervals. (B) Per-category sensitivity of the agentic system in BT-RADS category order. (C) Multi-agent system confusion matrix. (D) Initial clinical assessment confusion matrix. In C and D, color intensity represents row-normalized sensitivity; cell values are raw counts; diagonal cells (correct classifications) are outlined.
  • Figure 4: Clinical information extraction and BT-RADS classification for three representative de-identified cases. (A) BT-1b: bevacizumab use identified from clinical notes; volumetric improvement routed through the medication effect pathway. (B) BT-2: both FLAIR and enhancement volumes stable within $\pm$20%; no active medications. (C) BT-4: both components worsening (FLAIR $+$231%, enhancement $+$187%) with at least one exceeding 40%; radiation completed more than 90 days prior. Each panel shows the clinical note excerpt with highlighted evidence spans, extracted variables, volumetric data, and the decision pathway leading to the final classification.
  • Figure 5: Per-category classification concordance between the automated system (76.0%; 374/492), initial clinical assessment (57.5%; 283/492), and study population reference standard ($n = 492$). Each square represents one examination, colored by BT-RADS category. Solid squares indicate correct classification; faded squares indicate misclassification. Per-category accuracy is listed below (automated system / initial clinical assessment). McNemar $P < .001$.