Table of Contents
Fetching ...

RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, Tianshui Chen, Keze Wang, Liang Lin, Guangrun Wang

TL;DR

RADAR targets a fundamental mismatch in Vision-Language-Action benchmarks by focusing on real-world dynamics, explicit spatial-physical reasoning, and autonomous 3D evaluation. It introduces a centralized RADAR pipeline with a robot arm, wrist camera, stereo vision, and an actuated platform to enable 24/7 autonomous testing, while modeling environmental, agent-centric, and semantic perturbations, plus latency. The benchmark enforces 3D outcomes via volumetric IoU and 6-DoF actions, and evaluates robustness through four task splits and varied distractors, revealing strong fragility in current VLA models under realistic dynamics. The work demonstrates that state-of-the-art VLA models overfit to static or 2D cues and struggle with genuine 3D spatial reasoning and language grounding, underscoring the need for robust, generalizable embodied intelligence in real-world settings.

Abstract

VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.

RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

TL;DR

RADAR targets a fundamental mismatch in Vision-Language-Action benchmarks by focusing on real-world dynamics, explicit spatial-physical reasoning, and autonomous 3D evaluation. It introduces a centralized RADAR pipeline with a robot arm, wrist camera, stereo vision, and an actuated platform to enable 24/7 autonomous testing, while modeling environmental, agent-centric, and semantic perturbations, plus latency. The benchmark enforces 3D outcomes via volumetric IoU and 6-DoF actions, and evaluates robustness through four task splits and varied distractors, revealing strong fragility in current VLA models under realistic dynamics. The work demonstrates that state-of-the-art VLA models overfit to static or 2D cues and struggle with genuine 3D spatial reasoning and language grounding, underscoring the need for robust, generalizable embodied intelligence in real-world settings.

Abstract

VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.
Paper Structure (53 sections, 6 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 53 sections, 6 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: A comparison of benchmarking paradigms. We categorize existing VLA evaluation into Simulation (scalable but lacks physical dynamics) and constrained Real-World setups (realistic but static, lacking spatial consideration, and often relying on manual evaluation). RADAR (Right) bridges these gaps by providing a unified bench that guarantees Real-World Dynamics via systematic dynamics, Spatial Reasoning via geometric tasks, and fully Autonomous Evaluation using high-precision 3D metrics.
  • Figure 2: A comparison between current VLA benchmarks and the proposed RADAR bench. Existing benchmarks often rely on (1) static, simplified environments, (2) tasks that require only intuitive perception, and (3) manual or limited 2D evaluation. In contrast, RADAR introduces (1) complex real-world dynamics (e.g., lighting, backgrounds), (2) tasks designed for enhanced spatial-physical reasoning, and (3) a fully automated, 3D-based evaluation loop for infinite scaling.
  • Figure 3: The RADAR Pipeline. RADAR operates as a centralized platform utilizing a client--server--worker architecture. Users submit requests to the central server, which manages authorization, task queuing, and scheduling strategies. The server then dispatches tasks to workers for execution---such as controlling robots or processing video---with the final evaluation results fed back to the user.
  • Figure 4: RADAR Interface Protocols. A detailed view of the communication interfaces between the Client, Server, and Worker components. The Client initiates evaluation requests (Begin_eval, Action, Reset), which are routed to the Worker. The Server coordinates resource allocation via Get_Worker and monitors system health through Heartbeat signals from workers.
  • Figure 5: Evaluation Metric Computation via 3D Reconstruction. (a) The system captures Multi-view RGB-D data from the external stereo cameras to create a comprehensive 4D visual representation of the workspace. (b) 3D Reconstruction and Evaluation. The system generates a dense 3D voxel reconstruction of the scene to compare the current object state against the target state.
  • ...and 6 more figures