Table of Contents
Fetching ...

Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following

Valts Blukis, Ross A. Knepper, Yoav Artzi

TL;DR

This work tackles the challenge of instruction-following for robots in environments containing objects not seen during training. It introduces a few-shot language-conditioned segmentation method grounded in an extensible object database and constructs an allocentric object-context grounding map that encodes object usage per instruction. A two-stage policy (visitation distributions followed by velocity control) ingests the object-context map and segmentation outputs, enabling generalization to unseen objects by simply adding exemplars; deployment to real robots is achieved by swapping the grounding module without retraining. evaluated on a physical quadcopter and in simulation, the proposed FsPVN method substantially outperforms prior state-of-the-art baselines, especially in unseen-object scenarios, and ablations confirm the value of the grounding maps and the SuReAL training framework. The approach offers practical impact by enabling scalable, interpretable language-grounded robotics with easy domain transfer and minimal data requirements.

Abstract

We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training.

Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following

TL;DR

This work tackles the challenge of instruction-following for robots in environments containing objects not seen during training. It introduces a few-shot language-conditioned segmentation method grounded in an extensible object database and constructs an allocentric object-context grounding map that encodes object usage per instruction. A two-stage policy (visitation distributions followed by velocity control) ingests the object-context map and segmentation outputs, enabling generalization to unseen objects by simply adding exemplars; deployment to real robots is achieved by swapping the grounding module without retraining. evaluated on a physical quadcopter and in simulation, the proposed FsPVN method substantially outperforms prior state-of-the-art baselines, especially in unseen-object scenarios, and ablations confirm the value of the grounding maps and the SuReAL training framework. The approach offers practical impact by enabling scalable, interpretable language-grounded robotics with easy domain transfer and minimal data requirements.

Abstract

We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training.

Paper Structure

This paper contains 45 sections, 12 equations, 16 figures, 4 tables, 2 algorithms.

Figures (16)

  • Figure 1: Task and approach illustration, including a third-person view of the environment (unavailable to the agent), an agent's first-person RGB observation, a natural language instruction, and an object database. The agent's reasoning can be extended by adding entries to the database.
  • Figure 2: Few-shot language-conditioned segmentation illustration. Alignment scores are computed by comparing the visual similarity of database images to proposed bounding boxes and the textual similarity of database phrases with object references (e.g., the noisy "the planter turn"). The aligned bounding boxes are refined to create segmentation masks for each mentioned object.
  • Figure 3: Policy architecture illustration. The first stage uses our few-shot language-conditioned segmentation to identify mentioned objects in the image. The segmentation and instruction embedding are used to generate an allocentric object context grounding map $\mathbf{C}^{W}_t$, a learned map of the environment that encodes at every position the behavior to be performed at or near it. We use $\textsc{LingUNet}$ to predict visitation distributions, which the second stage maps to velocity commands. The components in blue are adopted from prior work blukis2018followingblukis2019learning, while we add the components in green to enable few-shot generalization. Appendix \ref{['app:model']} includes a whole-page version of this figure.
  • Figure 4: Human evaluation results on the physical quadcopter in environments with only new objects. We plot the Likert scores using Gantt charts of score frequencies, with mean scores in black.
  • Figure 5: Visualization of the model reasoning when executing the instruction go straight and stop before reaching the planter turn left towards the globe and go forward until just before it. The extracted object references are highlighted in the instruction in blue, and other noun chunks in red. The probability $\hat{P}(o | r)$ that aligns each object reference with an object in the database is visualized at the top-left pane. An overhead view of the quadcopter trajectory visualized over a simulated image of the environment layout is given at the top-right pane. For timestep 0 (left) and 27 (right), we show the first-person image $I_t$ observed at timestep $t$, the probability $\hat{P}(o|b)$ that aligns each proposed image region $b \in \mathcal{B}$ with an object in the database, the alignment score $\textsc{Align}(b,r)$ between image regions and object references computed from Equation \ref{['eq:grounding-inference']}, the resulting first-person segmentation masks $S(I,r,\mathcal{O})$, the projected object masks $M^W(I,r,\mathcal{O})$ obtained by projecting $S(I,r,\mathcal{O})$ into an allocentric reference frame, and the predicted visitation distributions $d^{p}$ (red) and $d^{g}$ (green).
  • ...and 11 more figures