Table of Contents
Fetching ...

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

Ran Xu, Yan Shen, Xiaoqi Li, Ruihai Wu, Hao Dong

TL;DR

This work introduces a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions, and proposes a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.

Abstract

Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

TL;DR

This work introduces a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions, and proposes a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.

Abstract

Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.
Paper Structure (24 sections, 4 equations, 4 figures, 2 tables)

This paper contains 24 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration on the fine-grained instructions. The leftmost and rightmost pairs represent action-prompt and perception-prompt bases respectively. In the center column are the manipulation steps for "Slide the top drawer open," each accompanied by fine-grained language instructions. If the current task's manipulation step shares the same action or noun phrase as another task's manipulation step in the fine-grained language instruction, cross-modal alignment will be conducted using the features of the action-prompt base and the perception-prompt base.
  • Figure 2: We introduce NrVLM, a comprehensive benchmark comprising multiple manipulation tasks annotated with fine-grained natural language instructions. Visualization of select tasks from the benchmark is presented in the top two rows. Additionally, we introduce difference task variations to enrich the diversity and complexity of the benchmark, as demonstrated in the bottom two rows.
  • Figure 3: The overall framework. The bottom part shows the manipulation process, where the Instruction Selection network (InstrSel) selects the appropriate fine-grained language instruction, the Affordance network (AFF-NET) predicts the object-centric affordance map, and the Actor network (ACT-NET) predicts the gripper action. The top part shows the alternative perception-prompt module and action-prompt modules, they enhance the Affordance and Actor networks by aligning the noun-related perception-prompt set and verb-related action-prompt set. The two dotted arrows before Affordance and Actor networks indicate that the prompt modules are optional. The entire method is trained in an end-to-end manner.
  • Figure 4: Instruction generating process for two large models (Minigpt4 on the left). The red text in the green box is the important element missed by the large models, "highlevelsentence" is the high-level instruction of the current task, and "stepslength" is the total number of steps.