Table of Contents
Fetching ...

RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches

Priya Sundaresan, Quan Vuong, Jiayuan Gu, Peng Xu, Ted Xiao, Sean Kirmani, Tianhe Yu, Michael Stark, Ajinkya Jain, Karol Hausman, Dorsa Sadigh, Jeannette Bohg, Stefan Schaal

TL;DR

This work presents RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions, and shows that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity.

Abstract

Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings. For supplementary material and videos, please refer to our website: http://rt-sketch.github.io.

RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches

TL;DR

This work presents RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions, and shows that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity.

Abstract

Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings. For supplementary material and videos, please refer to our website: http://rt-sketch.github.io.
Paper Structure (28 sections, 3 equations, 17 figures, 2 tables)

This paper contains 28 sections, 3 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: (Left) Qualitative rollouts comparing RT-Sketch, RT-1, and RT-Goal-Image, (right) highlighting RT-Sketch's robustness to (top) ambiguous language and (bottom) visual distractors.
  • Figure 2: Architecture of RT-Sketch allowing different kinds of visual input. RT-Sketch adopts the Transformer vaswani2017attention architecture with EfficientNet tan2019efficientnet tokenization at the input, and outputs bucketized actions.
  • Figure 3: Goal Alignment Results: Average Likert scores for different policies rating perceived semantic alignment (Q1) and spatial alignment (Q2) to a provided goal. For straightforward benchmark manipulation tasks, RT-Sketch performs comparably and in some cases better than RT-1 and RT-Goal-Image in terms of both metrics, for 5 out of 6 skills (H1). RT-Sketch further exhibits the ability to handle sketches of different levels of detail (H2), while achieving better goal alignment than baselines when the visual scene is distracting (H3) or language would be ambiguous (H4). Error bars indicate standard error across labeler ratings.
  • Figure 4: Perceived Spatial Alignment for Sketches Drawn by Other Annotators (H2): Across line sketches drawn by 6 annotators who are not represented in the training dataset for RT-Sketch, we record policy rollouts with these sketches as input for the move near skill. We evaluate the resulting rollouts across 22 human evaluators who provide Likert ratings measuring spatial alignment between the achieved goal state and given sketch. RT-Sketch's performance on these new input sketches is on par with policy performance on our original sketches (OURS), and with no significant dropoff between sketches drawn by different annotators.
  • Figure 5: ContourDrawing Dataset: We visualize 6 samples from the ContourDrawing Dataset from li2019photo. For each image, 5 separate annotators provide an edge-aligned sketch of the scene by outlining on top of the original image. As depicted, annotators are encouraged to preserve main contours of the scene, but background details or fine-grained geometric details are often omitted. li2019photo then train an image-to-sketch translation network $\mathcal{T}$ with a loss that encourages aligning with at least one of the given reference sketches (\ref{['eq:LT']}).
  • ...and 12 more figures