Table of Contents
Fetching ...

BronchOpt : Vision-Based Pose Optimization with Fine-Tuned Foundation Models for Accurate Bronchoscopy Navigation

Hongchao Shu, Roger D. Soberanis-Mukul, Jiru Xu, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

TL;DR

BronchOpt tackles the critical problem of robust intra-operative bronchoscopy localization under respiratory motion and CT-to-body divergence by unifying a modality- and domain-invariant encoder, an iterative pose optimization network, and a differentiable rendering-based refinement. Trained entirely on synthetic data, it achieves precise frame-wise 2D–3D registration between real endoscopic views and CT anatomy, with an average translation of $2.65$ mm and rotation of $0.19$ rad, and a $96\%$ success rate on a public synthetic benchmark. The framework demonstrates strong cross-domain generalization to real patient data without domain-specific adaptation, validated by qualitative improvements in alignment and depth-consistency metrics. A public synthetic bronchoscopy benchmark is introduced to standardize evaluation and spur reproducible progress in vision-based bronchoscopy navigation.

Abstract

Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.

BronchOpt : Vision-Based Pose Optimization with Fine-Tuned Foundation Models for Accurate Bronchoscopy Navigation

TL;DR

BronchOpt tackles the critical problem of robust intra-operative bronchoscopy localization under respiratory motion and CT-to-body divergence by unifying a modality- and domain-invariant encoder, an iterative pose optimization network, and a differentiable rendering-based refinement. Trained entirely on synthetic data, it achieves precise frame-wise 2D–3D registration between real endoscopic views and CT anatomy, with an average translation of mm and rotation of rad, and a success rate on a public synthetic benchmark. The framework demonstrates strong cross-domain generalization to real patient data without domain-specific adaptation, validated by qualitative improvements in alignment and depth-consistency metrics. A public synthetic bronchoscopy benchmark is introduced to standardize evaluation and spur reproducible progress in vision-based bronchoscopy navigation.

Abstract

Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The proposed vision-based pose optimization pipeline. (a) The network performs frame-wise 2D–3D registration between a real bronchoscopic RGB image $I$ and a rendered CT-derived depth map $d$. Given the initial camera pose $T_{0}$, $F_d$ is warped to $F_I$ space, and both features are fused then processed by a pose head that predicts pose increments $\Delta T^{P}_{i}$, progressively aligning the two views. (b) The optimized pose $T^{P} = T_{0} \prod_{i=1}^{N} \Delta T^{P}_{i}$ is used to render a depth map $d_{0}$ from the CT mesh, while a depth estimator infers a pseudo depth map $d_e$ from the live scene $I$. The rendering loss $L_{Diffrender}(d_e, d_j)$ enforces depth consistency between the two, refining pose estimates $T^R =T^{P} \prod_{j=1}^{M} \Delta T^{R}_{j}$ in a closed-loop manner for improved registration accuracy. The red arrow denotes the backpropagation path of the differentiable rendering loop.
  • Figure 2: Qualitative alignment and depth error visualization on real patient data. First two columns: Contour overlays illustrate the geometric correspondence between rendered CT views and real bronchoscopic images. Initially, the contours show noticeable misalignment with the real scene, but after applying our optimization, they align closely, indicating improved pose estimation accuracy. Last two columns: depth error maps before (Init) and after optimization (Ours).
  • Figure 3: Example data structure for the synthetic benchmark dataset.
  • Figure 4: and example image pairs for the synthetic benchmark dataset.
  • Figure 5: Statistics for each cases.