Table of Contents
Fetching ...

Reward Learning from Suboptimal Demonstrations with Applications in Surgical Electrocautery

Zohre Karimi, Shing-Hei Ho, Bao Thach, Alan Kuntz, Daniel S. Brown

TL;DR

This work tackles learning robotic surgical policies from suboptimal demonstrations under partial observability by learning a reward function from pairwise human preferences using offline data, then optimizing a policy via reinforcement learning. A point-cloud autoencoder provides a compact latent representation of partial observations, enabling robust reward estimation with $R_\theta$ and $J_\theta(\tau)=\sum_{o\in\tau} R_\theta(o)$. The approach is validated in simulation on two electrocautery-like tasks and demonstrated in a real ex vivo bovine tissue setup, achieving up to 80% task success and five successes in seven trials, respectively. Overall, the method reduces the need for near-optimal demonstrations and supports learning from qualitative human feedback in high-dimensional observation spaces, advancing sample-efficient, reward-based surgical policy learning.

Abstract

Automating robotic surgery via learning from demonstration (LfD) techniques is extremely challenging. This is because surgical tasks often involve sequential decision-making processes with complex interactions of physical objects and have low tolerance for mistakes. Prior works assume that all demonstrations are fully observable and optimal, which might not be practical in the real world. This paper introduces a sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations consisting of partial-view point cloud observations. The method then learns a policy by optimizing the learned reward function using reinforcement learning (RL). We show that using a learned reward function to obtain a policy is more robust than pure imitation learning. We apply our approach on a physical surgical electrocautery task and demonstrate that our method can perform well even when the provided demonstrations are suboptimal and the observations are high-dimensional point clouds. Code and videos available here: https://sites.google.com/view/lfdinelectrocautery

Reward Learning from Suboptimal Demonstrations with Applications in Surgical Electrocautery

TL;DR

This work tackles learning robotic surgical policies from suboptimal demonstrations under partial observability by learning a reward function from pairwise human preferences using offline data, then optimizing a policy via reinforcement learning. A point-cloud autoencoder provides a compact latent representation of partial observations, enabling robust reward estimation with and . The approach is validated in simulation on two electrocautery-like tasks and demonstrated in a real ex vivo bovine tissue setup, achieving up to 80% task success and five successes in seven trials, respectively. Overall, the method reduces the need for near-optimal demonstrations and supports learning from qualitative human feedback in high-dimensional observation spaces, advancing sample-efficient, reward-based surgical policy learning.

Abstract

Automating robotic surgery via learning from demonstration (LfD) techniques is extremely challenging. This is because surgical tasks often involve sequential decision-making processes with complex interactions of physical objects and have low tolerance for mistakes. Prior works assume that all demonstrations are fully observable and optimal, which might not be practical in the real world. This paper introduces a sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations consisting of partial-view point cloud observations. The method then learns a policy by optimizing the learned reward function using reinforcement learning (RL). We show that using a learned reward function to obtain a policy is more robust than pure imitation learning. We apply our approach on a physical surgical electrocautery task and demonstrate that our method can perform well even when the provided demonstrations are suboptimal and the observations are high-dimensional point clouds. Code and videos available here: https://sites.google.com/view/lfdinelectrocautery
Paper Structure (15 sections, 5 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 5 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our proposed method first learns a latent feature representation by pre-training an autoencoder to reconstruct partial-view point clouds. Then, given pairwise preferences over demonstrations with observations encoded by the latent feature representation, our method learns a reward function that maximizes the likelihood of the pairwise preferences.
  • Figure 2: Our autoencoder takes in the green point cloud and outputs the red reconstructed point cloud. * denotes (RELU $\circ$ group norm $\circ$ 1D convolution), FC denotes (ReLU $\circ$ linear layer) and the tuples denote the shape of the input to each layer. Convolution and group norm are done along the second dimension of the input. Max pooling is done along the first dimension of the input
  • Figure 3: Experimental setups with the dVRK surgical robot in the Isaac Gym simulator.
  • Figure 4: Visualization: given a specified number of attachment point(s) in the scene, end-effector x-y positions (red points) are sampled on the same horizontal plane of the attachment point(s). The z-value of each red point is the predicted reward given the partial point cloud observation and the coordinates of the corresponding end-effector. Brighter color means higher predicted reward.
  • Figure 5: Learning curves of RL with different action spaces: (blue) end-effector position control, (green) end-effector velocity control, and (orange) joint velocity control. End-effector position control achieves the highest success rate.
  • ...and 3 more figures