Table of Contents
Fetching ...

GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

Aoran Mei, Jianhua Wang, Guo-Niu Zhu, Zhongxue Gan

TL;DR

This work tackles robotic task planning with visual-language models by mitigating hallucinations and semantic complexity through a multi-agent framework called GameVLM. It integrates two GPT-4V-based decision agents, an expert evaluator, and a real-time open-vocabulary detector (YOLO-World) within a zero-sum game that rewards consistency and accuracy through a Q&A exchange. Experimental results on real robots show an average success rate of 83.3% across varied tasks, with particular strength in imitation and object stacking but weaker performance in predicting future actions. The approach advances robust, multimodal reasoning for robotic planning and demonstrates practical gains in dynamic, real-world environments, while highlighting areas for improvement in long-horizon prediction and planning.

Abstract

With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.

GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

TL;DR

This work tackles robotic task planning with visual-language models by mitigating hallucinations and semantic complexity through a multi-agent framework called GameVLM. It integrates two GPT-4V-based decision agents, an expert evaluator, and a real-time open-vocabulary detector (YOLO-World) within a zero-sum game that rewards consistency and accuracy through a Q&A exchange. Experimental results on real robots show an average success rate of 83.3% across varied tasks, with particular strength in imitation and object stacking but weaker performance in predicting future actions. The approach advances robust, multimodal reasoning for robotic planning and demonstrates practical gains in dynamic, real-world environments, while highlighting areas for improvement in long-horizon prediction and planning.

Abstract

With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.
Paper Structure (15 sections, 6 figures, 4 tables)

This paper contains 15 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example of zero-sum game theory and multi-agents: the game of Gomoku. Each player is an agent. The winning player gains ten points, while the losing player loses ten points. The total sum of their scores remains constant.
  • Figure 2: GameVLM overview. We propose a GameVLM framework, which comprises an input module, two decision agents, an expert agent, and an object detection module. The decision and expert agents refer to VLMs. Two decision agents are used to generate task plans and codes, while an expert agent checks the consistency of these codes. A real-time open-vocabulary object detection model, i.e., YOLO-World, is introduced to detect objects in the image.
  • Figure 3: Demonstration of the YOLO-World on real images. The model can identify apples in the image, regardless of whether they are labeled as "apple," "red apple," "red_apple," or "red fruit."
  • Figure 4: Overview of the system setup.
  • Figure 5: Example prompts in the GameVLM framework.
  • ...and 1 more figures