Multi-Agent Planning Using Visual Language Models

Michele Brienza; Francesco Argenziano; Vincenzo Suriani; Domenico D. Bloisi; Daniele Nardi

Multi-Agent Planning Using Visual Language Models

Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi

TL;DR

The paper tackles the challenge of planning with Vision-Language models in unstructured environments by proposing a hierarchical, multi-agent framework that grounds planning in a single environmental image. It distributes perception and reasoning across three agents (SKM, GKM, Planner) and leverages GPT-4V for image understanding with GPT-4 for planning, aiming to reduce hallucinations and context size. A novel PG2S metric evaluates plan quality semantically, using both sentence-level and goal-level similarities and combining them via a weighted rule, without requiring execution. Empirical validation on the ALFRED dataset shows that the single-image, multi-agent approach outperforms table-based perception and single-agent baselines, and PG2S provides a robust semantic measure of plan quality with improved sensitivity to meaningful variations in plan structure. Overall, the work advances embodied task planning by minimizing reliance on structured environment representations and introducing a semantic evaluation framework that better captures planning correctness and safety implications.

Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Multi-Agent Planning Using Visual Language Models

TL;DR

Abstract

Paper Structure (11 sections, 3 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 3 equations, 3 figures, 3 tables, 1 algorithm.

Related Work
LLMs as Planners
Multi-agent Prompting
Methodology
Multi-agent Planning
Evaluation
Experimental Results
Evaluation of our PG2S Metric
Evaluation of our Architecture
Discussion
Conclusion

Figures (3)

Figure 1: Overall view of the proposed framework. Given a task description and an image of the scene, the plan is obtained with multi-agent planning and assessed with the new score.
Figure 2: Complete and detailed architecture of the proposed method. The task description and the image are given in input to the agents that extract meaningful information from the scene. Their output is then processed by the planner agent that obtain the final plan. Such plan is then compared with the ground truth and evaluated according our new metric that takes into account semantically meaningful information.
Figure 3: One of the scenes used for the experimental tests. A screenshot from AI2Thor is used to perform the planning.

Multi-Agent Planning Using Visual Language Models

TL;DR

Abstract

Multi-Agent Planning Using Visual Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)