Table of Contents
Fetching ...

Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

Junhyun Park, Chunggil An, Myeongbo Park, Ihsan Ullah, Sihyeong Park, Minho Hwang

TL;DR

The paper tackles the challenge of markerless 6D pose estimation and closed-loop control for continuum manipulators in endoscopic surgery. It introduces a physics-grounded, photo-realistic synthetic data pipeline, a stereo-aware multi-feature fusion network, and a feed-forward rendering-based refinement to achieve real-time, geometrically consistent pose estimates for PBVS. A self-supervised sim-to-real adaptation using pseudo ground-truth improves real-world accuracy, and real-world validations demonstrate substantial reductions in translation and rotation errors compared with prior methods, enabling effective markerless visual servoing. The framework demonstrates robust, markerless PBVS with competitive performance to marker-based systems while reducing hardware complexity and improving practical deployment potential in surgical contexts.

Abstract

Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.

Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

TL;DR

The paper tackles the challenge of markerless 6D pose estimation and closed-loop control for continuum manipulators in endoscopic surgery. It introduces a physics-grounded, photo-realistic synthetic data pipeline, a stereo-aware multi-feature fusion network, and a feed-forward rendering-based refinement to achieve real-time, geometrically consistent pose estimates for PBVS. A self-supervised sim-to-real adaptation using pseudo ground-truth improves real-world accuracy, and real-world validations demonstrate substantial reductions in translation and rotation errors compared with prior methods, enabling effective markerless visual servoing. The framework demonstrates robust, markerless PBVS with competitive performance to marker-based systems while reducing hardware complexity and improving practical deployment potential in surgical contexts.

Abstract

Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.
Paper Structure (49 sections, 57 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 49 sections, 57 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Pseudo-rigid-body (PRB) kinematic model of the two-segment continuum manipulator. The system consists of two flexible segments, a rigid connector, and a dual-jaw gripper, where each flexible segment is discretized into revolute joints and rigid links. The first segment enables pitch--yaw bending, the second provides pitch bending, and the tool center point (TCP) is defined by the jaw geometry with associated yaw variable $r$.
  • Figure 2: URDF-based modeling and simulation pipeline for synthetic data generation. (Left) The continuum manipulator is discretized into a pseudo-rigid-body URDF comprising base rotation, bending segments, and TCP components. (Right) The model is imported into NVIDIA Isaac Sim with a calibrated stereo camera setup, enabling synchronized stereo rendering with automatic ground-truth pose annotation.
  • Figure 3: Ground-truth annotation configuration. (Left, top) Keypoint and mask definitions for the marker-free configuration: 65 keypoints (28 per jaw, 9 on hinge) and four-class segmentation masks. (Left, bottom) Marker-based configuration used for real-world ground-truth acquisition. (Right) Example synthetic stereo pairs with automatically generated keypoint coordinates and 6D pose labels.
  • Figure 4: Four representative examples of domain-randomized synthetic data. In each sample, background texture, backbone diffuse color, marker appearance, and jaw metallic properties are jointly randomized to maximize visual diversity across the training set.
  • Figure 5: Overview of the proposed stereo-aware pose estimation framework. (I) ROI Extraction: A YOLO detector localizes the manipulator in left and right stereo images. (II) Multi-Feature Fusion: A shared ResNet-50 encoder extracts keypoints, heatmaps, segmentation masks, and refined bounding boxes, which are embedded and concatenated for pose regression. (III) Stereo-Aware Attention: Cross-view features are aggregated via multi-head attention, using the left view as the query and the right view as key--value pairs. (IV) Rendering-Based Refinement: The initial pose is rendered, and geometric discrepancies between rendered and observed features are used to predict residual pose corrections in a single feed-forward pass.
  • ...and 8 more figures