Table of Contents
Fetching ...

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

Zhongyuang Liu, Min He, Shaonan Yu, Xinhang Xu, Muqing Cao, Jianping Li, Jianfei Yang, Lihua Xie

Abstract

Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.

OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

Abstract

Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.
Paper Structure (27 sections, 10 equations, 9 figures, 2 tables)

This paper contains 27 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: OmniVLN, a zero-shot visual-language navigation (VLN) framework coupling $360^\circ$ 3D perception with token-efficient reasoning across aerial & ground platforms.
  • Figure 2: Overview of the proposed framework. The Multimodal Perception module (left) achieves $360^\circ$ spatio-temporal consistency by fusing data from a rotating LiDAR and panoramic fisheye cameras, enabling the generation of high-fidelity semantic point clouds across robotic platforms. The Hierarchical Representation module (center) incrementally constructs an online five-layer DSG, bridging the gap between low-level geometric places and high-level macro-spatial rooms, while employing a VLM-based hybrid pruning mechanism to refine physical veracity. The LLM-based Reasoning module (right) transforms global graph knowledge into an agent-centric 3D octant observation model, utilizing a closed-loop actor--critic framework and DSG-guided hierarchical prompting to translate natural language queries into executable navigation actions.
  • Figure 3: Omnidirectional projection model. A 3D point $P_t=(x,y,z)$ in the robot's egocentric frame is mapped onto a unit sphere and projected to pixel coordinates $(u,v)$ of a $360^\circ$ panoramic image using the equirectangular projection model.
  • Figure 4: Visualization of the multi-modal perception in the IoT laboratory. The top row shows panoramic RGB observations from the agent's perspective. The bottom row displays the corresponding 3D semantic point clouds with instantiated objects and their unique IDs, which serve as the raw input for the DSG hierarchy construction.
  • Figure 5: Multi-resolution spatial attention mechanism for token-efficient navigation. The central red diamond denotes the ego-agent. Semantic object categories are color-coded as follows: chair (blue), table (green), cabinet (orange).
  • ...and 4 more figures