Table of Contents
Fetching ...

Accurately and Efficiently Interpreting Human-Robot Instructions of Varying Granularities

Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Lawson L. S. Wong, Stefanie Tellex

TL;DR

This work addresses grounding natural language commands to robot tasks across multiple abstraction levels by unifying a deep neural language grounding component with a top-down AMDP hierarchy. It introduces lifted reward functions and a grounding module to bridge language to environment-specific bindings, enabling fast, accurate planning. The key contributions include first-ever multi-level language grounding, superior grounding accuracy with single-RNN and related architectures, and substantial planning speedups demonstrated both in simulation and on a Turtlebot. The approach significantly enhances robustness and efficiency for human-robot interaction across a wide range of command granularities.

Abstract

Humans can ground natural language commands to tasks at both abstract and fine-grained levels of specificity. For instance, a human forklift operator can be instructed to perform a high-level action, like "grab a pallet" or a low-level action like "tilt back a little bit." While robots are also capable of grounding language commands to tasks, previous methods implicitly assume that all commands and tasks reside at a single, fixed level of abstraction. Additionally, methods that do not use multiple levels of abstraction encounter inefficient planning and execution times as they solve tasks at a single level of abstraction with large, intractable state-action spaces closely resembling real world complexity. In this work, by grounding commands to all the tasks or subtasks available in a hierarchical planning framework, we arrive at a model capable of interpreting language at multiple levels of specificity ranging from coarse to more granular. We show that the accuracy of the grounding procedure is improved when simultaneously inferring the degree of abstraction in language used to communicate the task. Leveraging hierarchy also improves efficiency: our proposed approach enables a robot to respond to a command within one second on 90% of our tasks, while baselines take over twenty seconds on half the tasks. Finally, we demonstrate that a real, physical robot can ground commands at multiple levels of abstraction allowing it to efficiently plan different subtasks within the same planning hierarchy.

Accurately and Efficiently Interpreting Human-Robot Instructions of Varying Granularities

TL;DR

This work addresses grounding natural language commands to robot tasks across multiple abstraction levels by unifying a deep neural language grounding component with a top-down AMDP hierarchy. It introduces lifted reward functions and a grounding module to bridge language to environment-specific bindings, enabling fast, accurate planning. The key contributions include first-ever multi-level language grounding, superior grounding accuracy with single-RNN and related architectures, and substantial planning speedups demonstrated both in simulation and on a Turtlebot. The approach significantly enhances robustness and efficiency for human-robot interaction across a wide range of command granularities.

Abstract

Humans can ground natural language commands to tasks at both abstract and fine-grained levels of specificity. For instance, a human forklift operator can be instructed to perform a high-level action, like "grab a pallet" or a low-level action like "tilt back a little bit." While robots are also capable of grounding language commands to tasks, previous methods implicitly assume that all commands and tasks reside at a single, fixed level of abstraction. Additionally, methods that do not use multiple levels of abstraction encounter inefficient planning and execution times as they solve tasks at a single level of abstraction with large, intractable state-action spaces closely resembling real world complexity. In this work, by grounding commands to all the tasks or subtasks available in a hierarchical planning framework, we arrive at a model capable of interpreting language at multiple levels of specificity ranging from coarse to more granular. We show that the accuracy of the grounding procedure is improved when simultaneously inferring the degree of abstraction in language used to communicate the task. Leveraging hierarchy also improves efficiency: our proposed approach enables a robot to respond to a command within one second on 90% of our tasks, while baselines take over twenty seconds on half the tasks. Finally, we demonstrate that a real, physical robot can ground commands at multiple levels of abstraction allowing it to efficiently plan different subtasks within the same planning hierarchy.

Paper Structure

This paper contains 19 sections, 6 equations, 6 figures.

Figures (6)

  • Figure 1: Examples of high-level and fine-grained commands issued to the Turtlebot robot in a mobile-manipulation task.
  • Figure 2: Model architectures for all three sets of deep neural network models. In blue are the network inputs, and in red are the network outputs. Going left to right, the green denotes significant structural differences between models.
  • Figure 3: Amazon Mechanical Turk (AMT) dataset domain and examples.
  • Figure 4: Task grounding accuracy (averaged over 5 trials) when training IBM2 and Single-RNN models on a single level of abstraction, then evaluating commands from alternate levels. This is similar to the macglashan2015grounding results, as we see that without accounting for abstractions in language, there is a noticeable effect on grounding accuracy.
  • Figure 5: Accuracy of 10-Fold Cross Validation (averaged over 3 runs) for each of the models on the AMT Dataset.
  • ...and 1 more figures