Table of Contents
Fetching ...

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu

TL;DR

The paper tackles the challenge of translating high-level semantic self-reflection into fine-grained robotic action corrections. It introduces Phoenix, a motion-based self-reflection framework consisting of a dual-process motion adjustment system and a motion-conditioned diffusion policy, enabling robust, high-frequency action correction guided by MLLMs. A lifelong learning mechanism further enhances performance by refining the motion prediction model from interaction data while leveraging expert demonstrations to prevent forgetting. Experiments in RoboMimic simulation and real-world tasks demonstrate superior generalization, robustness to perceptual variations, and effective correction of failures, with a public code release.

Abstract

Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

TL;DR

The paper tackles the challenge of translating high-level semantic self-reflection into fine-grained robotic action corrections. It introduces Phoenix, a motion-based self-reflection framework consisting of a dual-process motion adjustment system and a motion-conditioned diffusion policy, enabling robust, high-frequency action correction guided by MLLMs. A lifelong learning mechanism further enhances performance by refining the motion prediction model from interaction data while leveraging expert demonstrations to prevent forgetting. Experiments in RoboMimic simulation and real-world tasks demonstrate superior generalization, robustness to perceptual variations, and effective correction of failures, with a public code release.

Abstract

Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.

Paper Structure

This paper contains 19 sections, 1 equation, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our motion-based self-reflection framework utilizes coarse-grained motion instruction as a bridge to convert the high-level semantic reflection into fine-grained robotic action correction, thereby facilitating generalizable and precise action correction with perceptual and inferential capabilities of MLLMs.
  • Figure 2: The pipeline of our motion-based self-reflection framework. (a) demonstrates the dual-process motion refinement mechanism which leverages the motion prediction module for efficient motion instruction prediction and motion correction module for comprehensive failure correction. (b) illustrates the motion-conditioned diffusion policy which converts the low-frequency motion instruction guidance into high-frequency robotic action. The lifelong learning approach in (c) iteratively enhance the ability of the motion prediction module from the refined interaction trajectories.
  • Figure 3: Illustrations of our correction data: online human interventions, offline human annotations, and expert demonstrations.
  • Figure 4: The lifelong learning results. The results prove that our motion-based self-reflection method could iteratively improve performance through interactions with environments.
  • Figure 5: In the color disruption setting, we replace the red block with the blue block in the Stack_D0 task as shown in (a). In the position disruption setting, we change the position of the coffee machine from a fixed point to a random position from the rectangle in the Coffee_D0 task as illustrated in (b). The results in (c) prove that our framework could generalize well to these novel task settings.
  • ...and 1 more figures