Table of Contents
Fetching ...

Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR

Embodied4C introduces a closed-loop, multi-embodiment benchmark that jointly evaluates vision-language reasoning and control across autonomous driving, aerial navigation, and robotic manipulation. By separating VQA (scenario understanding) from VLN (control) and probing semantic, spatial, temporal, and physical reasoning under domain shifts, the framework reveals that cross-modal alignment and instruction tuning are crucial for embodied competence, while spatial/temporal grounding remains the main bottleneck. The study benchmarks ten foundation models and four domain-specialized baselines, finding that generalist, well-grounded models (e.g., GPT-5 family) achieve stronger cross-embodiment performance and generalization than scale alone would predict, whereas domain-specialized agents struggle to generalize beyond their native priors. These findings advocate for persistent world modeling, robust cross-domain grounding, and prompts-driven generalization to advance robust, general-purpose embodied agents.

Abstract

Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.

Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation

TL;DR

Embodied4C introduces a closed-loop, multi-embodiment benchmark that jointly evaluates vision-language reasoning and control across autonomous driving, aerial navigation, and robotic manipulation. By separating VQA (scenario understanding) from VLN (control) and probing semantic, spatial, temporal, and physical reasoning under domain shifts, the framework reveals that cross-modal alignment and instruction tuning are crucial for embodied competence, while spatial/temporal grounding remains the main bottleneck. The study benchmarks ten foundation models and four domain-specialized baselines, finding that generalist, well-grounded models (e.g., GPT-5 family) achieve stronger cross-embodiment performance and generalization than scale alone would predict, whereas domain-specialized agents struggle to generalize beyond their native priors. These findings advocate for persistent world modeling, robust cross-domain grounding, and prompts-driven generalization to advance robust, general-purpose embodied agents.

Abstract

Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.

Paper Structure

This paper contains 49 sections, 12 equations, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Overview of the Embodied4C benchmark. Embodied4C spans three embodiment domains (autonomous driving, aerial navigation, robotic manipulation) and four embodied scenario understanding capabilities (semantic, spatial, temporal, physical). The benchmark evaluates autonomous agents in visual question answering and vision-language navigation for scenario understanding and control execution, and includes domain-far questions to probe generalization (i.e., sensor set, general knowledge, weather, scenario). This design targets core value propositions of vision-language models as embodied agents: generality, multi-embodiment competence, and natural interaction.
  • Figure 2: Qualitative examples of Embodied4C VQA across diverse benchmark scenarios and domains. The figure illustrates typical success and failure patterns for semantic, spatial, temporal, physical, and general reasoning across all tested models.
  • Figure 3: Distributions for VQA questions and VLN instructions across all three sub-benchmarks and Embodied4C.
  • Figure 4: Illustrative overview of manipulation robotic arm configurations across the Franka Panda, Sawyer, and UR5 arm.
  • Figure 5: Per-model performance across fine-grained sub-capabilities in the Embodied4C benchmark. Each cell shows the mean accuracy (%) for a model (rows) on a sub-capability (columns). Sub-capabilities are grouped into main capabilities (separated by thick vertical lines): semantic (CL, AT, ST), spatial (LC, DT, OR, TP, CT), temporal Reasoning (R/S, R/P, R/D), and physical reasoning (M/T, S/M, L/M, MD, CS, M/P, E/E). The final column (GEN) reports performance on general knowledge questions.
  • ...and 2 more figures