Table of Contents
Fetching ...

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

TL;DR

CFG-Bench addresses the gap in evaluating embodied agents' fine-grained action intelligence by proposing a four-tier cognitive taxonomy and a comprehensive dataset of 1,368 videos with 19,562 QA across 11 tasks. It combines closed-ended MCQs and open-ended, GPT-assisted evaluation with counterfactual challenges to ground physical actions and intentions in visual evidence, including an explicit gating mechanism for false-premise queries. Experiments show current MLLMs struggle with fine-grained action instructions and higher-order reasoning, but supervised fine-tuning on CFG-Bench yields meaningful improvements on downstream embodied tasks. The work highlights the importance of cognitive grounding for embodied AI and provides a scalable framework for evaluating and improving grounded action understanding in vision-language models.

Abstract

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents. Project page: \href{https://cfg-bench.github.io/}{https://cfg-bench.github.io/}.

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

TL;DR

CFG-Bench addresses the gap in evaluating embodied agents' fine-grained action intelligence by proposing a four-tier cognitive taxonomy and a comprehensive dataset of 1,368 videos with 19,562 QA across 11 tasks. It combines closed-ended MCQs and open-ended, GPT-assisted evaluation with counterfactual challenges to ground physical actions and intentions in visual evidence, including an explicit gating mechanism for false-premise queries. Experiments show current MLLMs struggle with fine-grained action instructions and higher-order reasoning, but supervised fine-tuning on CFG-Bench yields meaningful improvements on downstream embodied tasks. The work highlights the importance of cognitive grounding for embodied AI and provides a scalable framework for evaluating and improving grounded action understanding in vision-language models.

Abstract

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents. Project page: \href{https://cfg-bench.github.io/}{https://cfg-bench.github.io/}.

Paper Structure

This paper contains 31 sections, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Illustration of CFG-Bench's focus on embodied intelligence over descriptive accuracy. The top part shows how FAVOR-Bench annotates and questions from a third-person perspective, a task which current MLLMs can often solve. In contrast, the bottom part demonstrates CFG-Bench's fine-grained annotation and first-person scenario questions, which probes for the actionable physical and intentional details necessary for embodied agents. Current MLLMs struggle to master the crucial fine-grained details required for physical interaction.
  • Figure 2: Task demonstration of CFG-Bench. Note: all QA pairs, including those above, are slightly simplified for clarity and brevity.
  • Figure 3: Data statistics of CFG-Bench. (a) Distribution and video length statistics of the five datasets. (b) The distribution of tasks across four tiers. AW means average words of questions.
  • Figure 4: Pipeline of dataset generation. Both annotation and QA generation are human-AI collaborative workflow. Open-ended and closed-ended questions share the same pipeline at the early stage.
  • Figure 5: The qualitative analysis of QA Forms.
  • ...and 8 more figures