Table of Contents
Fetching ...

STEER: Flexible Robotic Manipulation via Dense Language Grounding

Laura Smith, Alex Irpan, Montserrat Gonzalez Arenas, Sean Kirmani, Dmitry Kalashnikov, Dhruv Shah, Ted Xiao

TL;DR

This work presents STEER, a robot learning framework that bridges highlevel, commonsense reasoning with precise, flexible low-level control through training languagegrounded policies with dense annotation.

Abstract

The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges high-level, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training language-grounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot's behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training.

STEER: Flexible Robotic Manipulation via Dense Language Grounding

TL;DR

This work presents STEER, a robot learning framework that bridges highlevel, commonsense reasoning with precise, flexible low-level control through training languagegrounded policies with dense annotation.

Abstract

The complexity of the real world demands robotic systems that can intelligently adapt to unseen situations. We present STEER, a robot learning framework that bridges high-level, commonsense reasoning with precise, flexible low-level control. Our approach translates complex situational awareness into actionable low-level behavior through training language-grounded policies with dense annotation. By structuring policy training around fundamental, modular manipulation skills expressed in natural language, STEER exposes an expressive interface for humans or Vision-Language Models (VLMs) to intelligently orchestrate the robot's behavior by reasoning about the task and context. Our experiments demonstrate the skills learned via STEER can be combined to synthesize novel behaviors to adapt to new situations or perform completely new tasks without additional data collection or training.

Paper Structure

This paper contains 13 sections, 6 figures.

Figures (6)

  • Figure 1: System diagram for STEER. At training time, we re-annotate an offline dataset of diverse robot behaviors at training time, focusing on describing the primitive skills used to manipulate objects and, specifically, on annotating how the robot performed each skill. We then use this re-annotated dataset to train a language-conditioned low-level policy (RT-1 in our case). At inference time, when given a complex instruction like "pick up the flower pot without disturbing the plant", a high-level system (VLM or human) identifies the appropriate low-level skills and determines how to perform them. This emphasis on the "how" enables more contextual behavior.
  • Figure 2: Anchor vectors and their semantic labels. Purple, green, and pink vectors represent side, top-down, and diagonal grasps.
  • Figure 3: Sample initial conditions for the new object-grasping scenarios evaluated. (top left) a kettle with a handle extending above it, (top right) a potted plant, (bottom) 2/15 scenes for Fruit in Clutter. The kettle should be grasped over top. In order to avoid disturbing the plant, the flower pot should be grasped around its body. Lastly, the fruits should be grasped while avoiding knocking over the other objects in the scene.
  • Figure 4: Grasp steerability of OpenVLA, RT-1, and STEER. We test the steerability to grasp an object in different ways that would be appropriate for different, unseen tasks, e.g., in order to pour out of the Coke can, the robot should grasp the can around its body. When prompted to “Grasp the Coke can” from the top (top row) versus the side (bottom row), models without dense annotation show no perceivable change, while our densely labeled model adjusts its behavior, enabling new downstream tasks.
  • Figure 5: Results on grasping in unseen scenarios and performing a new task, with human or VLM guidance. We find that by having access to and being able to reason about extracted low-level strategies enables higher success in OOD scenarios than the baseline RT-1 model and a state-of-the-art VLA.
  • ...and 1 more figures