Table of Contents
Fetching ...

Vision-Based Online Key Point Estimation of Deformable Robots

Hehui Zheng, Sebastian Pinzello, Barnabas Gavin Cangan, Thomas Buchner, Robert K. Katzschmann

TL;DR

This work tackles the challenge of estimating the shape of highly deformable soft robots with infinite degrees of freedom by introducing VOKE, a two-view, vision-based CNN regression framework. VOKE outputs either a point-based set of key points or a piecewise constant curvature (PCC) model from paired grayscale images, enabling online, marker-less 3D shape estimation without prior shape knowledge. Across wax-cast, SoPrA, and a soft robotic fish, VOKE demonstrates competitive accuracy, robustness to lighting and noise, and real-time performance, outperforming existing marker-less baselines by up to 4.5% in tip estimation error. The approach lays groundwork for closed-loop control of soft robots by providing reliable, online shape estimates with a calibration-free camera alignment strategy.

Abstract

The precise control of soft and continuum robots requires knowledge of their shape, which has, in contrast to classical rigid robots, infinite degrees of freedom. To partially reconstruct the shape, proprioceptive techniques use built-in sensors resulting in inaccurate results and increased fabrication complexity. Exteroceptive methods so far rely on expensive tracking systems with reflective markers placed on all components, which are infeasible for deformable robots interacting with the environment due to marker occlusion and damage. Here, a regression approach is presented for 3D key point estimation using a convolutional neural network. The proposed approach takes advantage of data-driven supervised learning and is capable of online marker-less estimation during inference. Two images of a robotic system are taken simultaneously at 25 Hz from two different perspectives, and are fed to the network, which returns for each pair the parameterized key point or PCC shape representations. The proposed approach outperforms marker-less state-of-the-art methods by a maximum of 4.5% in estimation accuracy while at the same time being more robust and requiring no prior knowledge of the shape. Online evaluations on two types of soft robotic arms and a soft robotic fish demonstrate our method's accuracy and versatility on highly deformable systems.

Vision-Based Online Key Point Estimation of Deformable Robots

TL;DR

This work tackles the challenge of estimating the shape of highly deformable soft robots with infinite degrees of freedom by introducing VOKE, a two-view, vision-based CNN regression framework. VOKE outputs either a point-based set of key points or a piecewise constant curvature (PCC) model from paired grayscale images, enabling online, marker-less 3D shape estimation without prior shape knowledge. Across wax-cast, SoPrA, and a soft robotic fish, VOKE demonstrates competitive accuracy, robustness to lighting and noise, and real-time performance, outperforming existing marker-less baselines by up to 4.5% in tip estimation error. The approach lays groundwork for closed-loop control of soft robots by providing reliable, online shape estimates with a calibration-free camera alignment strategy.

Abstract

The precise control of soft and continuum robots requires knowledge of their shape, which has, in contrast to classical rigid robots, infinite degrees of freedom. To partially reconstruct the shape, proprioceptive techniques use built-in sensors resulting in inaccurate results and increased fabrication complexity. Exteroceptive methods so far rely on expensive tracking systems with reflective markers placed on all components, which are infeasible for deformable robots interacting with the environment due to marker occlusion and damage. Here, a regression approach is presented for 3D key point estimation using a convolutional neural network. The proposed approach takes advantage of data-driven supervised learning and is capable of online marker-less estimation during inference. Two images of a robotic system are taken simultaneously at 25 Hz from two different perspectives, and are fed to the network, which returns for each pair the parameterized key point or PCC shape representations. The proposed approach outperforms marker-less state-of-the-art methods by a maximum of 4.5% in estimation accuracy while at the same time being more robust and requiring no prior knowledge of the shape. Online evaluations on two types of soft robotic arms and a soft robotic fish demonstrate our method's accuracy and versatility on highly deformable systems.
Paper Structure (20 sections, 5 figures, 3 tables)

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Diagram of the marker-less online inference pipeline for the proposed key point estimation approach VOKE. A 3D shape is captured by two RGB video cameras. Image pairs are preprocessed and run through a convolutional neural network to estimate a shape model for the 3D shape.
  • Figure 2: Truncated VGG Network Architecture: VGG-s-bn. Inputs are the preprocessed binary images from both cameras, and output sizes depend on the selected shape representation and robot (Table 1).
  • Figure 3: Soft robots used for performance evaluation with evaluation point positions illustrated, in each panel, left shows the original RGB image, right shows the preprocessed image. (a) WaxCast Arm,katzschmann2019dynamic (b) WaxCast arm with visual features (black stripes),katzschmann2019dynamic (c) SoPrA arm,toshimitsu2021sopra (d) Soft fish.zhang2022creation
  • Figure 4: Estimation results of VOKE compared to ground truth positions. Experiment in (a) and (b) employ piecewise-constant curvature (PCC) model, while Experiment (c) to (e) estimate the positions of characteristic points separately. The number of sections considered in each experiment is shown in the figure. The red dots mark the ground truth positions obtained by the motion capture system and the blue dots mark the position estimated by VOKE.
  • Figure 5: Performance of VGG13 on SoPrA under different experimental setups, with sample gray-scale images and processed masks shown. (a) Tip estimation errors with varying image brightness. The brightness modification is quantified by the addition or subtraction of pixel values (0–255) from original gray-scale images. (b) Tip estimation errors with increasing Gaussian noise. The noise with standard deviation from 0 to 50 is added per pixel to the pixel values of original gray-scale images. (c) Marker 1 and marker 2 (tip) estimation errors plotted with black strip occlusions of varying width at the marker positions.