Table of Contents
Fetching ...

Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration

Jennifer Grannen, Siddharth Karamcheti, Suvir Mirchandani, Percy Liang, Dorsa Sadigh

TL;DR

This work introduces Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments with lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors.

Abstract

We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for "tracking around" an object, users are provided with trajectory visualizations of the robot's intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as "packing an object away" as compositions of low-level skills $-$ concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines and yielding more complex autonomous performance (+19.7%) with fewer failures (-67.1%). Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+20.6%) and overall performance (+13.9%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to shoot a 52 second (232 frame) movie.

Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration

TL;DR

This work introduces Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments with lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors.

Abstract

We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for "tracking around" an object, users are provided with trajectory visualizations of the robot's intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as "packing an object away" as compositions of low-level skills concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines and yielding more complex autonomous performance (+19.7%) with fewer failures (-67.1%). Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+20.6%) and overall performance (+13.9%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to shoot a 52 second (232 frame) movie.

Paper Structure

This paper contains 27 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Motivating Example. We present Vocal Sandbox, a framework for human-robot collaboration that enables robots to adapt and continually learn from situated interactions. In this example, a user arranges individual LEGO structures for each frame of a stop-motion film [Bottom], while a robot arm controls the camera. The user teaches the robot new behaviors through feedback modalities such as language and demonstrations [Left]. The robot learns online, scaling to more complex tasks as the collaboration continues [Right].
  • Figure 2: Planning & Teaching with Language Models. We use language models to parse user utterances to plans -- executable programs with functions and arguments defined by an API [Left]. Given a successful parse, we visualize an interpretable trace of both the plan and robot's intended behavior on a custom GUI (\ref{['subsec:gui']}). In the case of failure, we elicit teaching feedback from users to synthesize new functions and arguments (\ref{['subsec:teaching-via-synthesis']}).
  • Figure 3: Collaborative Gift-Bag Assembly. In this example, a study participant (\ref{['subsec:gift-bag-study']}) verbally asks the robot to "pack the toy car in the gift bag," leveraging pack, a newly taught behavior, to minimize his time supervising [Left]. When the robot fails to localize the "car" in the image, the user corrects this by clicking on the interactive GUI, producing a keypoint label, teaching a new argument and grounding corresponding skill [Right].
  • Figure 4: User Study Quantitative Results. We report robot supervision time [Left], behavior complexity [Middle], and skill failures [Right] across assembling individual gift bags in our user study (\ref{['subsec:gift-bag-study']}). Over time, users working with Vocal Sandbox (VS) systems teach more complex high-level behaviors, see fewer skill failures, and need to supervise the robot for shorter periods of time compared to baselines.
  • Figure 5: User Study Subjective Results and System GUI. We report qualitative user rankings for Vocal Sandbox (VS) and two baselines. With high significance ($p < 0.05$), we observe that VS outperforms the VS - (Low, High) baseline across all measures except predictability and trust [Left]. We also visualize the graphical user interface (GUI; \ref{['subsec:gui']}) displayed to users when interacting with the system [Right].
  • ...and 3 more figures