Table of Contents
Fetching ...

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Yifan Li, Lichi Li, Anh Dao, Xinyu Zhou, Yicheng Qiao, Zheda Mai, Daeun Lee, Zichen Chen, Zhen Tan, Mohit Bansal, Yu Kong

TL;DR

IndustryNav targets the gap in evaluating embodied VLLMs' spatial reasoning in dynamic industrial settings by introducing a Unity-based benchmark with 12 moving-scene warehouses and a zero-shot PointGoal navigation pipeline that fuses egocentric perception with global odometry. The framework uses five metrics, including a delta-threshold and a 1 m safety margin, to jointly assess task success, efficiency, and safety in a realistic, dynamic context. Nine VLLMs (five closed-source and four open-source) are evaluated, revealing that closed-source models generally outperform open-source ones, but all struggle with robust long-horizon planning, collision avoidance, and active exploration, highlighting safety as a major bottleneck. Ablation studies show that action-state histories meaningfully improve performance, while top-down map inputs provide limited benefits, collectively advancing understanding of holistic spatial reasoning for embodied AI in real-world industrial settings.

Abstract

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

TL;DR

IndustryNav targets the gap in evaluating embodied VLLMs' spatial reasoning in dynamic industrial settings by introducing a Unity-based benchmark with 12 moving-scene warehouses and a zero-shot PointGoal navigation pipeline that fuses egocentric perception with global odometry. The framework uses five metrics, including a delta-threshold and a 1 m safety margin, to jointly assess task success, efficiency, and safety in a realistic, dynamic context. Nine VLLMs (five closed-source and four open-source) are evaluated, revealing that closed-source models generally outperform open-source ones, but all struggle with robust long-horizon planning, collision avoidance, and active exploration, highlighting safety as a major bottleneck. Ablation studies show that action-state histories meaningfully improve performance, while top-down map inputs provide limited benefits, collectively advancing understanding of holistic spatial reasoning for embodied AI in real-world industrial settings.

Abstract

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

Paper Structure

This paper contains 33 sections, 6 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Illustration of IndustryNav benchmark. IndustryNav provides a zero-shot navigation setting where an embodied agent is prompted with an egocentric image, global odometry, and action–state history to reach a target while avoiding dynamic obstacles. The task-success results of nine VLLMs show that spatial reasoning in dynamic environments remains challenging, with closed-source models outperforming open-source ones; only Nemotron approaches closed-source performance.
  • Figure 2: Overview of the IndustryNav benchmark. IndustryNav is built on Unity and consists of 12 dynamic warehouse environments. The navigation pipeline combines local egocentric observations with global odometry information, enabling the embodied agent to generate appropriate navigation actions. To comprehensively evaluate navigation performance, we assess three dimensions: task success (Success Ratio and Distance Ratio), trajectory efficiency (Average Steps), and safety behaviors (Collision Ratio and Warning Ratio).
  • Figure 3: Visualization of the camera setup for the IndustryNav agent. An egocentric camera mounted on the agent’s body captures first-person visual observations of the surrounding environment, while a fixed bird’s-eye-view camera positioned above the warehouse continuously tracks the agent’s movement in real time using a red cone marker. The bottom panels show the corresponding egocentric and top-down visual perspectives used for navigation monitoring and trajectory analysis.
  • Figure 4: An illustration of the warning detection. The warning is triggered when the minimum depth values within the RoI region fall below a predefined threshold.
  • Figure 5: Illustration of both correct (first row) and incorrect (second and third rows) action behaviors of GPT-5-mini under the IndustryNav scenario. The red triangle indicates the agent's current position and direction, and the green circle marks the target location.
  • ...and 12 more figures