Table of Contents
Fetching ...

A three-dimensional force estimation method for the cable-driven soft robot based on monocular images

Xiaohan Zhu, Ran Bu, Zhen Li, Fan Xu, Hesheng Wang

TL;DR

Real-time estimation of 3D interaction forces for cable-driven soft robots using monocular vision is challenging with traditional sensors or 2D proxies. The authors present an end-to-end network that fuses monocular RGB images with PWM actuation signals through a 2D-3D feature fusion module, a unified feature representation, and an LSTM-based time-series decoder. Key contributions include bridging the image–force dimensional gap via depth estimation and segmentation, learning PWM-conditioned feature representations with cross-attention, and leveraging temporal context to mitigate hysteresis. The approach achieves marker-free, real-time 3D tip-force estimation on a four-cable soft robot, enabling safer and more capable interactive manipulation in practical tasks.

Abstract

Soft manipulators are known for their superiority in coping with high-safety-demanding interaction tasks, e.g., robot-assisted surgeries, elderly caring, etc. Yet the challenges residing in real-time contact feedback have hindered further applications in precise manipulation. This paper proposes an end-to-end network to estimate the 3D contact force of the soft robot, with the aim of enhancing its capabilities in interactive tasks. The presented method features directly utilizing monocular images fused with multidimensional actuation information as the network inputs. This approach simplifies the preprocessing of raw data compared to related studies that utilize 3D shape information for network inputs, consequently reducing configuration reconstruction errors. The unified feature representation module is devised to elevate low-dimensional features from the system's actuation signals to the same level as image features, facilitating smoother integration of multimodal information. The proposed method has been experimentally validated in the soft robot testbed, achieving satisfying accuracy in 3D force estimation (with a mean relative error of 0.84% compared to the best-reported result of 2.2% in the related works).

A three-dimensional force estimation method for the cable-driven soft robot based on monocular images

TL;DR

Real-time estimation of 3D interaction forces for cable-driven soft robots using monocular vision is challenging with traditional sensors or 2D proxies. The authors present an end-to-end network that fuses monocular RGB images with PWM actuation signals through a 2D-3D feature fusion module, a unified feature representation, and an LSTM-based time-series decoder. Key contributions include bridging the image–force dimensional gap via depth estimation and segmentation, learning PWM-conditioned feature representations with cross-attention, and leveraging temporal context to mitigate hysteresis. The approach achieves marker-free, real-time 3D tip-force estimation on a four-cable soft robot, enabling safer and more capable interactive manipulation in practical tasks.

Abstract

Soft manipulators are known for their superiority in coping with high-safety-demanding interaction tasks, e.g., robot-assisted surgeries, elderly caring, etc. Yet the challenges residing in real-time contact feedback have hindered further applications in precise manipulation. This paper proposes an end-to-end network to estimate the 3D contact force of the soft robot, with the aim of enhancing its capabilities in interactive tasks. The presented method features directly utilizing monocular images fused with multidimensional actuation information as the network inputs. This approach simplifies the preprocessing of raw data compared to related studies that utilize 3D shape information for network inputs, consequently reducing configuration reconstruction errors. The unified feature representation module is devised to elevate low-dimensional features from the system's actuation signals to the same level as image features, facilitating smoother integration of multimodal information. The proposed method has been experimentally validated in the soft robot testbed, achieving satisfying accuracy in 3D force estimation (with a mean relative error of 0.84% compared to the best-reported result of 2.2% in the related works).
Paper Structure (11 sections, 2 equations, 5 figures, 2 tables)

This paper contains 11 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) Inputs of the proposed end-to-end network are monocular image sequences and actuation signals. (b) Schematic diagram of soft robot 3D tip force estimation. (c) The main process of the proposed method.
  • Figure 2: Framework of proposed end-to-end tip force estimation network. Feature Extractor: Depth and segment images are generated from the input RGB using pre-trained DPT and FCN. All three images go through the CNNs and then flattened into image features. Unify Feature Representation: Pretrain a Feature Extractor for Feature-$pwm$ based on the PWM signal. Cross Attention: Establish a relationship between the Feature-$pwm$ and Feature-$image$ and fuse them into feature vector $\Phi$. LSTM Block: The force $F$ at time t is estimated by the LSTM network based on the time series feature vector.
  • Figure 3: (a): The experimental setup for data collection. The force sensor (1) is fixed on the back (outlined) of the contact plane (2). A cable-driven soft robot (3) can contact the plane on its tip under the action of 4 motors (4). An external camera (5) captures monocular images. All of the data are saved in an industrial computer (6). (b): Dataset collecting to estimate the tip force. From top to bottom are monocular, depth, and segment images.
  • Figure 4: A comparison of all proposed approaches. Left: The root mean square error(RMSE) of the force estimation for each axis($x$, $y$, and $z$), and $|F_c|$ is the resultant force. right: The mean relative error(MRE) of the force estimation for each axis($x$, $y$, and $z$), and $|F_c|$ is the resultant force.
  • Figure 5: A continuous visualization of force predictions. The first three columns show the discrepancies between the estimated and true values for each axis ($x$, $y$, and $z$), while the last column represents the resultant force. TMCAlexPWMNet performs better in all cases, particularly in terms of the moments when the soft robot makes contact with and detaches from the target.