Table of Contents
Fetching ...

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing

TL;DR

ECBench introduces a holistic embodied cognition benchmark for LVLMs operating on egocentric RGB-D video, addressing gaps in existing benchmarks by evaluating static, dynamic, and hallucination dimensions across 30 cognitive abilities. It combines a carefully curated video collection with a category-independent QA annotation strategy and a novel ECEval scoring framework that blends binary and multi-level assessments using 0.5-point references for open-ended items. Empirical results show current LVLMs struggle with dynamic scenes, robot-centric understanding, and embodied hallucinations, highlighting the need for improved temporal grounding and self-awareness in embodied agents. By providing a rigorous, open-world evaluation platform, ECBench and ECEval aim to drive development of more reliable core models for embodied robotic cognition and interaction.

Abstract

The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

TL;DR

ECBench introduces a holistic embodied cognition benchmark for LVLMs operating on egocentric RGB-D video, addressing gaps in existing benchmarks by evaluating static, dynamic, and hallucination dimensions across 30 cognitive abilities. It combines a carefully curated video collection with a category-independent QA annotation strategy and a novel ECEval scoring framework that blends binary and multi-level assessments using 0.5-point references for open-ended items. Empirical results show current LVLMs struggle with dynamic scenes, robot-centric understanding, and embodied hallucinations, highlighting the need for improved temporal grounding and self-awareness in embodied agents. By providing a rigorous, open-world evaluation platform, ECBench and ECEval aim to drive development of more reliable core models for embodied robotic cognition and interaction.

Abstract

The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.
Paper Structure (46 sections, 13 figures, 5 tables)

This paper contains 46 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Illustration of question answering (QA) format and representative cognitive dimensions from ECBench. There are 386 RGB-D videos, 4,324 QA pairs, and 30 distinct embodied cognitive abilities, spanning across various aspects such as perception, reasoning, self-awareness, dynamic capturing, and hallucination. ECEval employs distinct evaluation methods for different types of answers.
  • Figure 2: Overview of embodied cognition dimensions in ECBench. ECBench includes three subsets: static scenes, dynamic scenes, and hallucination, evaluating a total of 30 embodied cognitive abilities.
  • Figure 3: Data analysis of ECBench reflects a rich diversity of scenario categories, video sources, and evaluation dimensions.
  • Figure 4: Comparison with OpenEQA openeqa on textual data, including the distribution of question lengths, average question length, maximum question length, vocabulary size, number of questions, and number of capabilities.
  • Figure 5: Comparison of results between ECEval, Binary Scoring, and Multilevel Scoring, for open-ended and closed-ended questions. Notably, only open-ended questions are annotated with 0.5-point answers .
  • ...and 8 more figures