Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following
Valts Blukis, Ross A. Knepper, Yoav Artzi
TL;DR
This work tackles the challenge of instruction-following for robots in environments containing objects not seen during training. It introduces a few-shot language-conditioned segmentation method grounded in an extensible object database and constructs an allocentric object-context grounding map that encodes object usage per instruction. A two-stage policy (visitation distributions followed by velocity control) ingests the object-context map and segmentation outputs, enabling generalization to unseen objects by simply adding exemplars; deployment to real robots is achieved by swapping the grounding module without retraining. evaluated on a physical quadcopter and in simulation, the proposed FsPVN method substantially outperforms prior state-of-the-art baselines, especially in unseen-object scenarios, and ablations confirm the value of the grounding maps and the SuReAL training framework. The approach offers practical impact by enabling scalable, interpretable language-grounded robotics with easy domain transfer and minimal data requirements.
Abstract
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training.
