Table of Contents
Fetching ...

VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis

Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V. Guttag, Adrian V. Dalca

TL;DR

VoxelPrompt tackles the challenge of flexible, end-to-end radiology workflows by jointly training a language-model agent with a vision network to generate and execute executable analysis pipelines from natural language prompts. The system operates on native-resolution 3D volumes, using cross-volume attention and a persistent execution environment to produce segmentations, measurements, and language explanations across multi-acquisition studies. Key contributions include a unified framework that matches or exceeds single-task specialist baselines on diverse brain-imaging tasks, significant efficiency gains from native-resolution processing, and robust performance under varying acquisition types and data quality. The approach offers transparent, programmable workflows that can be integrated into clinical pipelines, enabling broader, open-ended biomedical analyses with AI assistance.

Abstract

We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives with an accuracy similar to that of specialist single-task models for image analysis, while facilitating a broad range of compositional biomedical workflows.

VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis

TL;DR

VoxelPrompt tackles the challenge of flexible, end-to-end radiology workflows by jointly training a language-model agent with a vision network to generate and execute executable analysis pipelines from natural language prompts. The system operates on native-resolution 3D volumes, using cross-volume attention and a persistent execution environment to produce segmentations, measurements, and language explanations across multi-acquisition studies. Key contributions include a unified framework that matches or exceeds single-task specialist baselines on diverse brain-imaging tasks, significant efficiency gains from native-resolution processing, and robust performance under varying acquisition types and data quality. The approach offers transparent, programmable workflows that can be integrated into clinical pipelines, enabling broader, open-ended biomedical analyses with AI assistance.

Abstract

We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives with an accuracy similar to that of specialist single-task models for image analysis, while facilitating a broad range of compositional biomedical workflows.

Paper Structure

This paper contains 27 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustrative examples of VoxelPrompt capabilities, each showing the input prompt (gray) and volumes with VoxelPrompt's predicted annotations and language responses (purple).
  • Figure 2: Top: VoxelPrompt takes a text prompt and volumes as input to a trainable agent $\alpha$. The agent iteratively produces executable code in a Python environment $\Omega$, which controls a jointly-trained vision model $m$. Bottom: To solve an example language-prompted task, the agent $\alpha$ interprets execution outcomes $z$ (blue) to guide subsequent instruction prediction across multiple steps. To perform vision operations, such as volume encoding or generation, $\alpha$ employs vision networks $m_\text{enc}$ and $m_\text{gen}$, which are manipulated by image-specific latent instruction embeddings $\phi$.
  • Figure 3: VoxelPrompt performance.(A) Free-form text prompts, shown below each image, guide VoxelPrompt to perform targeted analysis and delineation of nuanced, context-specific image regions, even in scans with multiple lesions. (B, C) On unseen datasets with diverse brain abnormalities, VoxelPrompt is the only method achieving consistently high-quality results both qualitatively and quantitatively. (D) Compared to longitudinal FreeSurfer, VoxelPrompt achieves the same effect size in distinguishing Alzheimer’s disease from controls with a $10^{5}\times$ faster runtime. (E) VoxelPrompt outperforms the state-of-the-art specialist model (SynthSeg) on whole brain segmentation.
  • Figure 4: Ablations and analyses. (A) A single VoxelPrompt model trained jointly on all tasks matches or exceeds task-specific models for both lesions (left) and anatomy (right). Asterisks indicate statistically significant differences. (B) Our proposed native-resolution convolutions are more efficient in runtime and memory than isotropic resampling. (C) Our attention mechanism for multi-input volume interaction is more robust to image corruptions compared to max and mean reductions.
  • Figure 5: Schematic of the lesion synthesis procedure. A lesion shape is first generated by attenuating and thresholding Brownian noise. The resulting segmentation map is resampled into the target image space, with size and position determined based on anatomical priors. The lesion is in-painted by pasting the tissue mask into the image with procedurally-generated texture and mean signal intensity based on randomly selected relative tissue characteristics.
  • ...and 2 more figures