Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Jinyi Liu, Yifu Yuan, Jianye Hao, Fei Ni, Lingzhi Fu, Yibin Chen, Yan Zheng
TL;DR
This work addresses visual policy learning for robot manipulation without task-specific priors by introducing CriticGPT, a fine-tuned multimodal LLM capable of analyzing trajectory videos and producing pairwise preferences. CriticGPT is trained on a large visual instruction-following dataset to yield high preference accuracy and is used to fit a dense reward model via a pairwise-preference objective, which then drives DRQ-v2 policy learning. The approach demonstrates superior performance to representation-based rewards and sparse rewards on the Meta-World benchmark and generalizes to unseen tasks, highlighting the practical potential of AI-generated feedback for efficient visual control. The results suggest that AI-driven, multimodal feedback can significantly reduce human labeling costs while improving data efficiency and task success in robotic manipulation.
Abstract
Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.
