Table of Contents
Fetching ...

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jeremy Liu, Ruiping Wang, Hao Dong

TL;DR

An Autonomous Interactive Correction (AIC) MLLM is proposed, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object, and is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.

Abstract

The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses which is particularly prone to occur during articulated object manipulation.To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We design two types of prompt instructions for interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2) textual descriptions to indicate potential directions for rotation correction. During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Our project website is https://sites.google.com/view/aic-mllm.

AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

TL;DR

An Autonomous Interactive Correction (AIC) MLLM is proposed, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object, and is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.

Abstract

The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses which is particularly prone to occur during articulated object manipulation.To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions for articulated object. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We design two types of prompt instructions for interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2) textual descriptions to indicate potential directions for rotation correction. During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Our project website is https://sites.google.com/view/aic-mllm.
Paper Structure (31 sections, 9 figures, 3 tables)

This paper contains 31 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Correction process of AIC MLLM. Given a failed interaction, we first extract feedback information regarding the object's geometry, then enable the model to reflect and correct both position and rotation estimation, thereby generating a more accurate SE(3) pose which will be executed by robot. We use a right-hand coordinate system to show the end effector's rotation pose. In the rightmost image, the yellow line represents the x-axis, the red line represents the y-axis, and the red line represents the z-axis.
  • Figure 2: Training of AIC MLLM. We gradually enable the model to predict poses and comprehend both visual and textual feedback prompts including object parts and axis information.
  • Figure 3: Testing of AIC MLLM. If failure interaction occurs, an FIE module is utilized to extract feedback information from previous failure attempts. This feedback information is integrated into visual and linguistic prompts, which are then fed into the trained model, enabling it to reflect, correct, and generate new action predictions. After inference on each test sample, the model undergoes parameter updates in the TTA module to enhance generalization to the current testing configuration.
  • Figure 4: Ablation of the proposed method and visualization. The image on the left illustrates the correlation between success rate and correction times. On the right, the correction process is depicted with the aid of simulation.
  • Figure 5: Examples of position correction. The first figure is the original prediction of the contact point, and the red dot in the second figure is a new prediction of contact point keeping away from the red mask. And the third figure is the interaction map where white area is the movable part of the object.
  • ...and 4 more figures