Intelligent Control of Robotic X-ray Devices using a Language-promptable Digital Twin
Benjamin D. Killeen, Anushri Suresh, Catalina Gomez, Blanca Inigo, Christopher Bailey, Mathias Unberath
TL;DR
This work tackles the challenge of controlling robotic X-ray C-arms via natural language by integrating a language-aligned foundation model with a FluoroSAM-based digital twin. A large language model translates spoken commands into actions and language prompts, which the digital twin uses to perform real-time visualization, patient-specific viewfinding, and automatic 3D collimation from sparse multi-view X-ray data, with segmentation driven by text prompts. In cadaveric and post hoc analyses, the system achieves end-to-end success around 83–84% and localizes targeted structures with a mean 3D centroid error of about $51.68 \pm 30.84$ mm, while maintaining a 3D bounding-box precision of $0.26$ and recall of $0.70$ for tested prompts. The results demonstrate the feasibility of language-driven, intelligent robotic X-ray assistants and suggest that, as foundation models improve, such systems could support versatile, physician-intent-driven intraoperative workflows with potential radiation-exposure benefits.
Abstract
Natural language offers a convenient, flexible interface for controlling robotic C-arm X-ray systems, making advanced functionality and controls accessible. However, enabling language interfaces requires specialized AI models that interpret X-ray images to create a semantic representation for reasoning. The fixed outputs of such AI models limit the functionality of language controls. Incorporating flexible, language-aligned AI models prompted through language enables more versatile interfaces for diverse tasks and procedures. Using a language-aligned foundation model for X-ray image segmentation, our system continually updates a patient digital twin based on sparse reconstructions of desired anatomical structures. This supports autonomous capabilities such as visualization, patient-specific viewfinding, and automatic collimation from novel viewpoints, enabling commands 'Focus in on the lower lumbar vertebrae.' In a cadaver study, users visualized, localized, and collimated structures across the torso using verbal commands, achieving 84% end-to-end success. Post hoc analysis of randomly oriented images showed our patient digital twin could localize 35 commonly requested structures to within 51.68 mm, enabling localization and isolation from arbitrary orientations. Our results demonstrate how intelligent robotic X-ray systems can incorporate physicians' expressed intent directly. While existing foundation models for intra-operative X-ray analysis exhibit failure modes, as they improve, they can facilitate highly flexible, intelligent robotic C-arms.
