Table of Contents
Fetching ...

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Dayou Li, Chenkun Zhao, Shuo Yang, Lin Ma, Yibin Li, Wei Zhang

TL;DR

The paper tackles instruction-guided robotic manipulation by predicting pixel-level manipulation affordances conditioned on natural language. It introduces IGANet, a model that fuses frozen vision (OWL-ViT) and language (Universal-Sentence-Encoder) encodings to produce an affordance map $M$, trained with cross-entropy loss $L$. A Vision-Language Model (VLM)–driven data augmentation pipeline using Inpaint-Anything and GPT-4V, plus an LLM-based planner, scales data and translates affordances into grasping $a_g=(p_g, heta_g)$ and pushing $a_p=(p_p,d_p)$ actions. Experiments across six real-world scenes show strong generalization to unseen objects and instructions, validating both data augmentation and the IGANet framework for practical instruction-guided manipulation.

Abstract

We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction are separate. However, in human behavioral patterns, the manipulation regions of the same object will change for different language instructions. In this paper, we propose Instruction-Guided Affordance Net (IGANet) for predicting affordance maps of instruction-guided robotic manipulation tasks by utilizing powerful priors from vision and language encoders pre-trained on large-scale datasets. We develop a Vison-Language-Models(VLMs)-based data augmentation pipeline, which can generate a large amount of data automatically for model training. Besides, with the help of Large-Language-Models(LLMs), actions can be effectively executed to finish the tasks defined by instructions. A series of real-world experiments revealed that our method can achieve better performance with generated data. Moreover, our model can generalize better to scenarios with unseen objects and language instructions.

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

TL;DR

The paper tackles instruction-guided robotic manipulation by predicting pixel-level manipulation affordances conditioned on natural language. It introduces IGANet, a model that fuses frozen vision (OWL-ViT) and language (Universal-Sentence-Encoder) encodings to produce an affordance map , trained with cross-entropy loss . A Vision-Language Model (VLM)–driven data augmentation pipeline using Inpaint-Anything and GPT-4V, plus an LLM-based planner, scales data and translates affordances into grasping and pushing actions. Experiments across six real-world scenes show strong generalization to unseen objects and instructions, validating both data augmentation and the IGANet framework for practical instruction-guided manipulation.

Abstract

We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction are separate. However, in human behavioral patterns, the manipulation regions of the same object will change for different language instructions. In this paper, we propose Instruction-Guided Affordance Net (IGANet) for predicting affordance maps of instruction-guided robotic manipulation tasks by utilizing powerful priors from vision and language encoders pre-trained on large-scale datasets. We develop a Vison-Language-Models(VLMs)-based data augmentation pipeline, which can generate a large amount of data automatically for model training. Besides, with the help of Large-Language-Models(LLMs), actions can be effectively executed to finish the tasks defined by instructions. A series of real-world experiments revealed that our method can achieve better performance with generated data. Moreover, our model can generalize better to scenarios with unseen objects and language instructions.
Paper Structure (14 sections, 1 equation, 7 figures, 1 table)

This paper contains 14 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: Illustration of human's reasoning process in handling instruction-guided manipulation tasks.
  • Figure 2: Illustration of the presented full pipeline for instruction-guided manipulation tasks. Our pre-labeled dataset will be scaled up via our data augmentation pipeline. Then the IGANet is trained on the generated dataset to predict affordance maps based on the language instructions. Finally, the LLM-based planner will give commands on action execution based on the affordance maps and instructions.
  • Figure 3: Data Generation Result. Our proposed data augmentation pipeline uses GPT-4 as an LLM to generate prompts for the Inpaint-Anything module to edit the image according to the generated prompts.
  • Figure 4: Structure of IGANet. IGANnet uses a frozen OWL vision encoder to encode RGB input, and the language instruction is encoded by a frozen Universal-Sentence encoder. The RGB feature and language feature perform Hadamard Product operation. The final output of IGANet is the affordance map of dense pixel-wise features.
  • Figure 5: LLM-Based Planner. The LLM-based planner uses GPT-4 as LLM to give action decisions based on our prompt engineering.
  • ...and 2 more figures