Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
TL;DR
Embodied4C introduces a closed-loop, multi-embodiment benchmark that jointly evaluates vision-language reasoning and control across autonomous driving, aerial navigation, and robotic manipulation. By separating VQA (scenario understanding) from VLN (control) and probing semantic, spatial, temporal, and physical reasoning under domain shifts, the framework reveals that cross-modal alignment and instruction tuning are crucial for embodied competence, while spatial/temporal grounding remains the main bottleneck. The study benchmarks ten foundation models and four domain-specialized baselines, finding that generalist, well-grounded models (e.g., GPT-5 family) achieve stronger cross-embodiment performance and generalization than scale alone would predict, whereas domain-specialized agents struggle to generalize beyond their native priors. These findings advocate for persistent world modeling, robust cross-domain grounding, and prompts-driven generalization to advance robust, general-purpose embodied agents.
Abstract
Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.
