Table of Contents
Fetching ...

Toward Zero-Shot User Intent Recognition in Shared Autonomy

Atharv Belsare, Zohre Karimi, Connor Mattson, Daniel S. Brown

TL;DR

This work tackles the challenge of shared autonomy when user intents are unknown by introducing Vision-Only Shared Autonomy (VOSA), a zero-shot, vision-based framework that infers manipulation intents from a single end-effector RGBD camera and arbitrates between human and robot control without demonstrations. VOSA combines a perception pipeline (YOLOv5-guided clustering), a prediction module (confidence-weighted intent scoring), and a linear blending arbitration to adapt in real time. In a kinasea Gen3 tabletop manipulation study with three tasks, VOSA matches oracle-baseline performance and outperforms direct teleoperation, particularly when intents are unknown or dynamic, while remaining preferred by users in challenging scenarios. The results demonstrate the practicality of zero-shot, vision-driven shared control for flexible and efficient human-robot collaboration in unstructured environments.

Abstract

A fundamental challenge of shared autonomy is to use high-DoF robots to assist, rather than hinder, humans by first inferring user intent and then empowering the user to achieve their intent. Although successful, prior methods either rely heavily on a priori knowledge of all possible human intents or require many demonstrations and interactions with the human to learn these intents before being able to assist the user. We propose and study a zero-shot, vision-only shared autonomy (VOSA) framework designed to allow robots to use end-effector vision to estimate zero-shot human intents in conjunction with blended control to help humans accomplish manipulation tasks with unknown and dynamically changing object locations. To demonstrate the effectiveness of our VOSA framework, we instantiate a simple version of VOSA on a Kinova Gen3 manipulator and evaluate our system by conducting a user study on three tabletop manipulation tasks. The performance of VOSA matches that of an oracle baseline model that receives privileged knowledge of possible human intents while also requiring significantly less effort than unassisted teleoperation. In more realistic settings, where the set of possible human intents is fully or partially unknown, we demonstrate that VOSA requires less human effort and time than baseline approaches while being preferred by a majority of the participants. Our results demonstrate the efficacy and efficiency of using off-the-shelf vision algorithms to enable flexible and beneficial shared control of a robot manipulator. Code and videos available here: https://sites.google.com/view/zeroshot-sharedautonomy/home.

Toward Zero-Shot User Intent Recognition in Shared Autonomy

TL;DR

This work tackles the challenge of shared autonomy when user intents are unknown by introducing Vision-Only Shared Autonomy (VOSA), a zero-shot, vision-based framework that infers manipulation intents from a single end-effector RGBD camera and arbitrates between human and robot control without demonstrations. VOSA combines a perception pipeline (YOLOv5-guided clustering), a prediction module (confidence-weighted intent scoring), and a linear blending arbitration to adapt in real time. In a kinasea Gen3 tabletop manipulation study with three tasks, VOSA matches oracle-baseline performance and outperforms direct teleoperation, particularly when intents are unknown or dynamic, while remaining preferred by users in challenging scenarios. The results demonstrate the practicality of zero-shot, vision-driven shared control for flexible and efficient human-robot collaboration in unstructured environments.

Abstract

A fundamental challenge of shared autonomy is to use high-DoF robots to assist, rather than hinder, humans by first inferring user intent and then empowering the user to achieve their intent. Although successful, prior methods either rely heavily on a priori knowledge of all possible human intents or require many demonstrations and interactions with the human to learn these intents before being able to assist the user. We propose and study a zero-shot, vision-only shared autonomy (VOSA) framework designed to allow robots to use end-effector vision to estimate zero-shot human intents in conjunction with blended control to help humans accomplish manipulation tasks with unknown and dynamically changing object locations. To demonstrate the effectiveness of our VOSA framework, we instantiate a simple version of VOSA on a Kinova Gen3 manipulator and evaluate our system by conducting a user study on three tabletop manipulation tasks. The performance of VOSA matches that of an oracle baseline model that receives privileged knowledge of possible human intents while also requiring significantly less effort than unassisted teleoperation. In more realistic settings, where the set of possible human intents is fully or partially unknown, we demonstrate that VOSA requires less human effort and time than baseline approaches while being preferred by a majority of the participants. Our results demonstrate the efficacy and efficiency of using off-the-shelf vision algorithms to enable flexible and beneficial shared control of a robot manipulator. Code and videos available here: https://sites.google.com/view/zeroshot-sharedautonomy/home.
Paper Structure (22 sections, 3 equations, 6 figures, 1 table)

This paper contains 22 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Vision-Only Shared Autonomy (VOSA) combines the benefits of shared autonomy and robot perception to generalize out-of-the-box to new scenes by dynamically perceiving all possible intents, predicting the human's desired intent, and arbitrating the human and robot control actions.
  • Figure 2: A Simple Instantiation of VOSA Perception. (a) The 2D raw RGB scene from the perspective of a camera mounted at the robot's end-effector. (b) The 3D RGB + Depth (RGBD) scene as a point cloud in Simulation. (c) Point cloud preprocessing filters out the table and background. (d) $k$-means clustering is used to classify points to an object. The value of $k$ is obtained from a pretrained YOLOv5 redmon2016look model. (e) Centroids of the point cloud clusters represent possible human intents.
  • Figure 3: User Study Tasks. (a) Pick and Place task where users were asked to relocate two objects in the scene to indicated placement intents (wood pedestals). (b) Deceptive Grasping task where the robot must reach in between two other objects to grasp the target object. (c) Shelving task where the robot helps a human teammate stock a shelf with sports drinks.
  • Figure 4: The arbitration function increases the influence of the robot's command ($\alpha$) as the robot's confidence $c(t)$ in inferring the human's true intent increases.
  • Figure 5: Quantitative Results. (a) Task Completion Time and (b) Input Magnitude for a user study of 18 total users. Subjects were exposed to three shared autonomy paradigms: direct teleoperation, a shared autonomy baseline with known intents (SAG), and Vision-only shared autonomy (VOSA). Note that if SAG is instantiated without oracle goal information, it functionally reduces to direct teleoperation, whereas VOSA is able to adapt on-the-fly and infer new intents at runtime. In cases where direct teleoperation is burdensome (pick and place, shelving) and cases where intents are not correctly specified prior to the task (shelving, deceptive grasping), VOSA provides a zero-shot assistance paradigm that reduces both burden and intent uncertainty.
  • ...and 1 more figures