Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Dayou Li; Chenkun Zhao; Shuo Yang; Lin Ma; Yibin Li; Wei Zhang

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Dayou Li, Chenkun Zhao, Shuo Yang, Lin Ma, Yibin Li, Wei Zhang

TL;DR

The paper tackles instruction-guided robotic manipulation by predicting pixel-level manipulation affordances conditioned on natural language. It introduces IGANet, a model that fuses frozen vision (OWL-ViT) and language (Universal-Sentence-Encoder) encodings to produce an affordance map $M$, trained with cross-entropy loss $L$. A Vision-Language Model (VLM)–driven data augmentation pipeline using Inpaint-Anything and GPT-4V, plus an LLM-based planner, scales data and translates affordances into grasping $a_g=(p_g, heta_g)$ and pushing $a_p=(p_p,d_p)$ actions. Experiments across six real-world scenes show strong generalization to unseen objects and instructions, validating both data augmentation and the IGANet framework for practical instruction-guided manipulation.

Abstract

We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction are separate. However, in human behavioral patterns, the manipulation regions of the same object will change for different language instructions. In this paper, we propose Instruction-Guided Affordance Net (IGANet) for predicting affordance maps of instruction-guided robotic manipulation tasks by utilizing powerful priors from vision and language encoders pre-trained on large-scale datasets. We develop a Vison-Language-Models(VLMs)-based data augmentation pipeline, which can generate a large amount of data automatically for model training. Besides, with the help of Large-Language-Models(LLMs), actions can be effectively executed to finish the tasks defined by instructions. A series of real-world experiments revealed that our method can achieve better performance with generated data. Moreover, our model can generalize better to scenarios with unseen objects and language instructions.

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

TL;DR

, trained with cross-entropy loss

. A Vision-Language Model (VLM)–driven data augmentation pipeline using Inpaint-Anything and GPT-4V, plus an LLM-based planner, scales data and translates affordances into grasping

and pushing

actions. Experiments across six real-world scenes show strong generalization to unseen objects and instructions, validating both data augmentation and the IGANet framework for practical instruction-guided manipulation.

Abstract

Paper Structure (14 sections, 1 equation, 7 figures, 1 table)

This paper contains 14 sections, 1 equation, 7 figures, 1 table.

Introduction
Related Works
VLMs-Driven Data Augmentation
Language-Guided Robotic Manipulation
Method
Pipeline Overview
Scaling up Data via VLMs
Learning Instruction-Guided Manipulation Affordance
Action Execution
Experiments
Environment Setup
Baseline Methods
Results
Conclusions

Figures (7)

Figure 1: Illustration of human's reasoning process in handling instruction-guided manipulation tasks.
Figure 2: Illustration of the presented full pipeline for instruction-guided manipulation tasks. Our pre-labeled dataset will be scaled up via our data augmentation pipeline. Then the IGANet is trained on the generated dataset to predict affordance maps based on the language instructions. Finally, the LLM-based planner will give commands on action execution based on the affordance maps and instructions.
Figure 3: Data Generation Result. Our proposed data augmentation pipeline uses GPT-4 as an LLM to generate prompts for the Inpaint-Anything module to edit the image according to the generated prompts.
Figure 4: Structure of IGANet. IGANnet uses a frozen OWL vision encoder to encode RGB input, and the language instruction is encoded by a frozen Universal-Sentence encoder. The RGB feature and language feature perform Hadamard Product operation. The final output of IGANet is the affordance map of dense pixel-wise features.
Figure 5: LLM-Based Planner. The LLM-based planner uses GPT-4 as LLM to give action decisions based on our prompt engineering.
...and 2 more figures

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

TL;DR

Abstract

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)