EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Lu Qiu; Yi Chen; Yuying Ge; Yixiao Ge; Ying Shan; Xihui Liu

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

TL;DR

EgoPlan-Bench2 targets the critical but underexplored capability of planning in Multimodal LLMs by evaluating 21 models on 1,321 next-action QA pairs derived from 1,113 egocentric Ego4D videos across 24 scenarios in 4 real-world domains. The authors design a semi-automatic data-collection pipeline with three stages (task-goal extraction, MCQA generation, model+human verification) and provide a rich benchmark that emphasizes long-horizon task progress and evolving observations. They show that current MLLMs struggle with planning, with GPT-4V achieving the best yet still modest performance (~32.6%), and they analyze domain- and modality-specific failure modes. As a key contribution, they propose a training-free multimodal Chain-of-Thought prompting approach that leverages historical task sequences and visual prompts (e.g., bounding boxes) to improve planning, boosting GPT-4V performance by up to 10.24 percentage points and achieving 43.04% with self-consistency. The dataset, analysis, and prompting framework offer a practical path toward human-level planning in real-world assistance, and the work sets a new standard for evaluating planning in video-based multimodal systems.

Abstract

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at https://qiulu66.github.io/egoplanbench2/.

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

TL;DR

Abstract

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)