Table of Contents
Fetching ...

Intelligent Control of Robotic X-ray Devices using a Language-promptable Digital Twin

Benjamin D. Killeen, Anushri Suresh, Catalina Gomez, Blanca Inigo, Christopher Bailey, Mathias Unberath

TL;DR

This work tackles the challenge of controlling robotic X-ray C-arms via natural language by integrating a language-aligned foundation model with a FluoroSAM-based digital twin. A large language model translates spoken commands into actions and language prompts, which the digital twin uses to perform real-time visualization, patient-specific viewfinding, and automatic 3D collimation from sparse multi-view X-ray data, with segmentation driven by text prompts. In cadaveric and post hoc analyses, the system achieves end-to-end success around 83–84% and localizes targeted structures with a mean 3D centroid error of about $51.68 \pm 30.84$ mm, while maintaining a 3D bounding-box precision of $0.26$ and recall of $0.70$ for tested prompts. The results demonstrate the feasibility of language-driven, intelligent robotic X-ray assistants and suggest that, as foundation models improve, such systems could support versatile, physician-intent-driven intraoperative workflows with potential radiation-exposure benefits.

Abstract

Natural language offers a convenient, flexible interface for controlling robotic C-arm X-ray systems, making advanced functionality and controls accessible. However, enabling language interfaces requires specialized AI models that interpret X-ray images to create a semantic representation for reasoning. The fixed outputs of such AI models limit the functionality of language controls. Incorporating flexible, language-aligned AI models prompted through language enables more versatile interfaces for diverse tasks and procedures. Using a language-aligned foundation model for X-ray image segmentation, our system continually updates a patient digital twin based on sparse reconstructions of desired anatomical structures. This supports autonomous capabilities such as visualization, patient-specific viewfinding, and automatic collimation from novel viewpoints, enabling commands 'Focus in on the lower lumbar vertebrae.' In a cadaver study, users visualized, localized, and collimated structures across the torso using verbal commands, achieving 84% end-to-end success. Post hoc analysis of randomly oriented images showed our patient digital twin could localize 35 commonly requested structures to within 51.68 mm, enabling localization and isolation from arbitrary orientations. Our results demonstrate how intelligent robotic X-ray systems can incorporate physicians' expressed intent directly. While existing foundation models for intra-operative X-ray analysis exhibit failure modes, as they improve, they can facilitate highly flexible, intelligent robotic C-arms.

Intelligent Control of Robotic X-ray Devices using a Language-promptable Digital Twin

TL;DR

This work tackles the challenge of controlling robotic X-ray C-arms via natural language by integrating a language-aligned foundation model with a FluoroSAM-based digital twin. A large language model translates spoken commands into actions and language prompts, which the digital twin uses to perform real-time visualization, patient-specific viewfinding, and automatic 3D collimation from sparse multi-view X-ray data, with segmentation driven by text prompts. In cadaveric and post hoc analyses, the system achieves end-to-end success around 83–84% and localizes targeted structures with a mean 3D centroid error of about mm, while maintaining a 3D bounding-box precision of and recall of for tested prompts. The results demonstrate the feasibility of language-driven, intelligent robotic X-ray assistants and suggest that, as foundation models improve, such systems could support versatile, physician-intent-driven intraoperative workflows with potential radiation-exposure benefits.

Abstract

Natural language offers a convenient, flexible interface for controlling robotic C-arm X-ray systems, making advanced functionality and controls accessible. However, enabling language interfaces requires specialized AI models that interpret X-ray images to create a semantic representation for reasoning. The fixed outputs of such AI models limit the functionality of language controls. Incorporating flexible, language-aligned AI models prompted through language enables more versatile interfaces for diverse tasks and procedures. Using a language-aligned foundation model for X-ray image segmentation, our system continually updates a patient digital twin based on sparse reconstructions of desired anatomical structures. This supports autonomous capabilities such as visualization, patient-specific viewfinding, and automatic collimation from novel viewpoints, enabling commands 'Focus in on the lower lumbar vertebrae.' In a cadaver study, users visualized, localized, and collimated structures across the torso using verbal commands, achieving 84% end-to-end success. Post hoc analysis of randomly oriented images showed our patient digital twin could localize 35 commonly requested structures to within 51.68 mm, enabling localization and isolation from arbitrary orientations. Our results demonstrate how intelligent robotic X-ray systems can incorporate physicians' expressed intent directly. While existing foundation models for intra-operative X-ray analysis exhibit failure modes, as they improve, they can facilitate highly flexible, intelligent robotic C-arms.

Paper Structure

This paper contains 10 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We present a natural language interface for commanding robotic X-ray devices using a multi-modal foundation model for X-ray imaging. Our approach, which we demonstrate in a real-time cadaver study (a), uses a large language model to parse the desired action and suitable prompt from the spoken input. The digital twin (b) uses FluoroSAM killeen2024fluorosam to segment anatomies based on the prompt, supporting real-time visualization (c), patient-specific viewfinding (d), and 3D collimation.
  • Figure 2: Performance and example masks for text-only prompting with FluoroSAM. Although certain classes struggle without additional point-based prompts, the model correctly localizes many structures based on CLIP radford2021learningwang2022medclip embeddings of natural language prompts, extracted by the LLM protocol, including unseen prompts like "lower lumbar vertebrae."
  • Figure 3: The reconstruction error of the digital twin in terms of localization and collimation of desired structures. We observe a tendency toward better localization and isolation for structures which FluoroSAM segments more easily.