Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

Jinyi Liu; Yifu Yuan; Jianye Hao; Fei Ni; Lingzhi Fu; Yibin Chen; Yan Zheng

Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

Jinyi Liu, Yifu Yuan, Jianye Hao, Fei Ni, Lingzhi Fu, Yibin Chen, Yan Zheng

TL;DR

This work addresses visual policy learning for robot manipulation without task-specific priors by introducing CriticGPT, a fine-tuned multimodal LLM capable of analyzing trajectory videos and producing pairwise preferences. CriticGPT is trained on a large visual instruction-following dataset to yield high preference accuracy and is used to fit a dense reward model via a pairwise-preference objective, which then drives DRQ-v2 policy learning. The approach demonstrates superior performance to representation-based rewards and sparse rewards on the Meta-World benchmark and generalizes to unseen tasks, highlighting the practical potential of AI-generated feedback for efficient visual control. The results suggest that AI-driven, multimodal feedback can significantly reduce human labeling costs while improving data efficiency and task success in robotic manipulation.

Abstract

Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.

Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 5 figures, 1 table)

This paper contains 16 sections, 3 equations, 5 figures, 1 table.

Introduction
Related Work
Multimodal Large Language Models.
Pre-trained Vision-Language Representation Model
Large Language Models for Control Tasks.
Reinforcement Learning from Preference Feedback
CriticGPT: A Multimodal LLM as a Critic
Visual Instruction-Following Dataset for Robot Manipulation
Adapting MLLM to Robot Manipulation Scenarios as a Critic
Policy Training
Experiment
Performance of Fine-Tuned CriticGPT
Collecting Dataset
Performance of CriticGPT
Efficiency of CriticGPT Preference Feedback for Policy Learning
...and 1 more sections

Figures (5)

Figure 1: The architecture of CriticGPT, along with concise input-output examples. CriticGPT accommodates video input and, based on natural language instructions, generates corresponding responses.
Figure 2: An overview of using automated feedback labels generated by CriticGPT to facilitate policy learning. Without introducing ambiguity, details regarding collecting video data and the transition buffer are omitted.
Figure 3: Results of DrQ-v2 with different reward on the Meta-World benchmark.
Figure 4: Illustrating differences in behavioral performance under various rewards using the coffee-button task as an example. Trajectories achieving success or near-success around 40k training steps are selected, and their initial, final, and intermediate frames are visualized.
Figure 5: Comparative analysis of the distribution differences in cumulative episode rewards obtained by different reward methods, with scattered points representing normalized cumulative reward values for different trajectories.

Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

TL;DR

Abstract

Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)