Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Junyang Wu; Mingyi Luo; Fangfang Xie; Minghui Zhang; Hanxiao Zhang; Chunxi Zhang; Junhao Wang; Jiayuan Sun; Yun Gu; Guang-Zhong Yang

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Junyang Wu, Mingyi Luo, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Chunxi Zhang, Junhao Wang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

TL;DR

A vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation is presented, supporting the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.

Abstract

Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 22 figures, 2 tables)

This paper contains 23 sections, 4 equations, 22 figures, 2 tables.

Introduction
Results
System overview and workflow
Intra-operative multi-agents design
Short-term Reactive Agent Design
Long-term Strategic Agent Design
World Model as Critic
Evaluation of Full Segmental Reach in a High-Fidelity Phantom
Robustness to Visual Perturbations
Ex Vivo Evaluation of Bronchial Navigation
Autonomous navigation in in vivo porcine model
Discussion
Task-Directed Expert Demonstrations
Style-Transfer-Augmented Generation
Capturing and labelling the ex vivo and in vivo datasets
...and 8 more sections

Figures (22)

Figure 1: Conceptual overview of the autonomous robotic navigation framework. Preoperatively, following a patient CT scan, fully automated algorithms are deployed to segment the bronchial tree and target lesions, and to plan the optimal intraoperative trajectory. This planned path is formulated as a sequential series of virtual image targets. Intraoperatively, the intelligent agent autonomously navigates through these consecutive sub-targets to ultimately access the bronchial segment nearest to the lesion. To achieve this, the agent continuously executes a dual-state decision policy: if it determines that the robot has successfully reached the current sub-target, it updates the objective to the subsequent waypoint; otherwise, it generates specific kinematic control commands to actuate the 6-degree-of-freedom (6-DoF) flexible robot toward the current target.
Figure 2: Architecture of the hierarchical multi-agent autonomous navigation framework. The system decomposes the navigation task across two temporal scales: a short-term reactivate agent that compensates for immediate endoluminal dynamics, and a long-term strategic agent that acts as a high-level supervisor. Inter-agent coordination is governed by an interaction and consensus mechanism. If the strategic agent’s proposed action aligns with the top-K predicted logits of the reactive agent, a consensus is achieved and the action is executed. In the event of a conflict, resolution is source-dependent: preoperative guidance commands will not be executed. Conversely, if a conflict originates from the LLM guidance, a world model serves as a critic to simulate potential downstream outcomes and deduce the optimal control signal.
Figure 3: Results of phantom experiments. A. Bronchial segmentation of the experimental phantom displaying 17 planned trajectories (five of which were utilized for artifact experiments). B. A comparison between artifact-degraded and clean images. C. A successful trajectory of clean phantom. D. A successful trajectory of artifact phantom. E. Frequency distribution of the levels reached by different methods. F. Comparison of time and number of actions across different methods on the clean phantom. G. Comparison of time and number of actions across different methods on the artifact phantom. H. Details of the maximum bronchial generation reached by different methods for each of the 17 trajectories. I. Quantitative analysis of the bronchial generations reached by different methods. J. SSIM between the expert's view and the final images obtained by our method and ViNT upon reaching the endpoint. K. SSIM comparison between the final images of our method and the expert's view in both clean and artifact phantom scenarios.
Figure 4: Ex vivo evaluation on diverse porcine lungs. (A) 3D segmentation of three distinct porcine lungs showing morphological variability. (B) Endoscopic views at identical anatomical landmarks across different lungs, highlighting visual domain gaps. (C) System response to specific challenges: navigating through mucus occlusion, autonomous target switching upon visual matching, and adaptive view adjustment to avoid obstructions. (D) Success rates of 59 navigation trajectories across bronchial generations. (E) Distribution of procedure time and action steps. (F) Failure modes caused by lens fouling and complete bubble occlusion.
Figure 5: Results of in-vivo experiments. A. Experimental setup for in vivo studies. B. Two examples by the automated system and the human expert. C. End-effector distance between the expert and the automated system. D. Structural Similarity Index Measure (SSIM) of the endpoints between the expert and the automated system. E. Distance to the nodule upon reaching the endpoint across different methods. F. Intraoperative CBCT images displaying preoperative bronchial and nodule segmentations, as well as intraoperative bronchoscope segmentation. G. Quantitative analysis of time and number of actions across different methods. H. Analysis of the lateral adjustment rate and the forward progression rate.
...and 17 more figures

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

TL;DR

Abstract

Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy

Authors

TL;DR

Abstract

Table of Contents

Figures (22)