Table of Contents
Fetching ...

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

TL;DR

EgoPlan-Bench2 targets the critical but underexplored capability of planning in Multimodal LLMs by evaluating 21 models on 1,321 next-action QA pairs derived from 1,113 egocentric Ego4D videos across 24 scenarios in 4 real-world domains. The authors design a semi-automatic data-collection pipeline with three stages (task-goal extraction, MCQA generation, model+human verification) and provide a rich benchmark that emphasizes long-horizon task progress and evolving observations. They show that current MLLMs struggle with planning, with GPT-4V achieving the best yet still modest performance (~32.6%), and they analyze domain- and modality-specific failure modes. As a key contribution, they propose a training-free multimodal Chain-of-Thought prompting approach that leverages historical task sequences and visual prompts (e.g., bounding boxes) to improve planning, boosting GPT-4V performance by up to 10.24 percentage points and achieving 43.04% with self-consistency. The dataset, analysis, and prompting framework offer a practical path toward human-level planning in real-world assistance, and the work sets a new standard for evaluating planning in video-based multimodal systems.

Abstract

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at https://qiulu66.github.io/egoplanbench2/.

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

TL;DR

EgoPlan-Bench2 targets the critical but underexplored capability of planning in Multimodal LLMs by evaluating 21 models on 1,321 next-action QA pairs derived from 1,113 egocentric Ego4D videos across 24 scenarios in 4 real-world domains. The authors design a semi-automatic data-collection pipeline with three stages (task-goal extraction, MCQA generation, model+human verification) and provide a rich benchmark that emphasizes long-horizon task progress and evolving observations. They show that current MLLMs struggle with planning, with GPT-4V achieving the best yet still modest performance (~32.6%), and they analyze domain- and modality-specific failure modes. As a key contribution, they propose a training-free multimodal Chain-of-Thought prompting approach that leverages historical task sequences and visual prompts (e.g., bounding boxes) to improve planning, boosting GPT-4V performance by up to 10.24 percentage points and achieving 43.04% with self-consistency. The dataset, analysis, and prompting framework offer a practical path toward human-level planning in real-world assistance, and the work sets a new standard for evaluating planning in video-based multimodal systems.

Abstract

The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at https://qiulu66.github.io/egoplanbench2/.

Paper Structure

This paper contains 29 sections, 1 equation, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Left: EgoPlan-Bench2 encompasses planning tasks spanning four major domains and 24 detailed scenarios for evaluating the planning capabilities of MLLMs in diverse real-world contexts. Right: Examples of our multiple-choice question-answer pairs, where a partial video showing historical task progress, a current observation image, and a task goal expressed in language are given for a model to select the most appropriate action.
  • Figure 2: The overview of the semi-automatic dataset construction pipeline for EgoPlan-Bench2. Stage I: Task Goal Extraction, where task goals are summarized from video narrations by GPT-4 with a hierarchical extraction and decomposition strategy, and are further filtered to eliminate overly complex tasks. Stage II: Multiple-choice QA Generation, where multiple-choice questions are generated based on the task goals and corresponding action sequences using predefined templates. Foundation models are utilized to select an appropriate image as the visual observation (i.e., the end of the video showing task progress). Stage III: Model and Human Verification, where model verification is conducted to reinforce the multimodal evaluation capability, and human annotators are employed to guarantee the reliability and objectivity of EgoPlan-Bench2.
  • Figure 3: The pipeline of the adaptive observation selection method. Several frames around the timestamp of the groundtruth action are cropped as candidate frames. GPT-4 and InternVL-1.5 are then employed to verify whether each candidate frame is qualified. In this example, the selected candidate frame contains all objects necessary for the next action, fulfilling the second criterion. However, since InternVL-1.5 can correctly predict the upcoming action without historical task progress information, this frame fails to meet the first criterion and should therefore be discarded.
  • Figure 4: Left: Scenarios distribution of EgoPlan-Bench2, which covers 4 major domains and 24 fine-grained scenarios. Right: Video length distribution. Our benchmark has a full spectrum of video duration, ranging from a few seconds to five minutes.
  • Figure 5: Word clouds of task goals and candidate options in EgoPlan-Bench2. From left to right: verbs in task goals, objects in task goals, verbs in candidate options, objects in candidate options.
  • ...and 9 more figures