Table of Contents
Fetching ...

VisualActBench: Can VLMs See and Act like a Human?

Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo

TL;DR

The paper tackles the gap in vision-language models' ability to reason and act autonomously from visual input, introducing Visual Action Reasoning and the VisualActBench benchmark. VisualActBench captures proactive decision-making across four real-world scenarios with actions annotated by Action Prioritization Level and proactiveness, enabling evaluation of both correctness and value alignment. Large-scale models like GPT-4o perform best but still fall short of human-level proactive reasoning, with substantial room for improvement in temporal grounding and outcome anticipation. The work provides a critical benchmark and analysis showing where current VLMs fail and how reinforcement learning and model scale can yield improvements, guiding future development of real-world, vision-centric agents.

Abstract

Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

VisualActBench: Can VLMs See and Act like a Human?

TL;DR

The paper tackles the gap in vision-language models' ability to reason and act autonomously from visual input, introducing Visual Action Reasoning and the VisualActBench benchmark. VisualActBench captures proactive decision-making across four real-world scenarios with actions annotated by Action Prioritization Level and proactiveness, enabling evaluation of both correctness and value alignment. Large-scale models like GPT-4o perform best but still fall short of human-level proactive reasoning, with substantial room for improvement in temporal grounding and outcome anticipation. The work provides a critical benchmark and analysis showing where current VLMs fail and how reinforcement learning and model scale can yield improvements, guiding future development of real-world, vision-centric agents.

Abstract

Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: For humans, seeing a cluttered room naturally prompts the intention to tidy it up. However, in the absence of explicit instructions such as "tidy up the room," VLMs, relying solely on their own capabilities, infer the relevant action based on visual cues.
  • Figure 2: Examples from VisualActBench, showcasing diverse real-world scenarios and the corresponding proactive actions with varying Action Priority Levels (APL). Each frame includes a proposed action, its associated APL, and whether the action is considered proactive.
  • Figure 3: Distribution of videos and actions in VisualActBench. The left two charts show the number of videos categorized by scenario type (Dynamic Navigation, Home Service, Safety and Monitoring, and Human-Machine Interaction) and by Action Priority Level (APL 0–4). The right two charts illustrate the number of actions categorized by proactivity (Proactive vs. Reactive) and by APL.
  • Figure 4: Normalized proactive ratios of various VLMs, reflecting their overall inclination to generate proactive rather than reactive actions across diverse scenarios. The red dashed line denotes the average proactive ratio across all evaluated models, serving as a reference for assessing the relative proactiveness and behavioral consistency of each system.