Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Yuxiang Zhao; Yirong Yang; Yanqing Zhu; Yanfen Shen; Chiyu Wang; Zhining Gu; Pei Shi; Wei Guo; Mu Xu

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Yuxiang Zhao, Yirong Yang, Yanqing Zhu, Yanfen Shen, Chiyu Wang, Zhining Gu, Pei Shi, Wei Guo, Mu Xu

TL;DR

The paper defines a new task—out-to-in prior-free instruction-driven embodied navigation—and presents BridgeNav, a vision-centric framework that uses latent intention and optical-flow-guided dynamic perception to navigate from outdoors to indoors using only egocentric visual observations and lightweight instructions. It introduces BridgeNavDataset, a large-scale, open-source dataset generated via trajectory-conditioned video synthesis to support training and evaluation of outdoor-to-indoor transitions. Experimental results show BridgeNav outperforms recent baselines in success rate and navigation efficiency, and ablations confirm the value of each key component. The work advances practical last-meter navigation by enabling precise entrance entry without relying on precise priors, maps, or extensive textual guidance.

Abstract

Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 9 figures, 3 tables)

This paper contains 32 sections, 5 equations, 9 figures, 3 tables.

Introduction
Related Work
Real-World Navigation.
BridgeNav
Problem Formulation
Overview
Latent Intention Inference
Optical Flow-Guided Dynamic Perception
Training Strategy
BridgeNavDataset
Raw Streetview Data.
Trajectory and Instruction Annotation.
Trajectory Guided Video Generation.
Target Anchor Refinement.
Experiments
...and 17 more sections

Figures (9)

Figure 1: Existing embodied navigation typically focuses exclusively on either indoor or outdoor scenes. However, embodied agents performing delivery tasks often need to transition seamlessly between these two environments. To bridge this gap, we propose a novel task BridgeNav that enables agents to navigate from outdoor to indoor and accurately enter buildings without relying on any additional priors.
Figure 2: Method overview. Our proposed framework consists of four main components. (1) A multimodal large language model for vision–language understanding. (2) Cross-attention modules that enable interaction between initial learnable tokens (used for trajectory prediction) and the multimodal content representations. (3) A latent intention inference module that identifies salient regions in the current observation to guide attention. (4) An optical flow–guided dynamic perception module that establishes a mapping between the agent's navigation trajectory and salient changes in future visual observations.
Figure 3: Dataset construction overview. Our proposed data generation pipeline consists of three main components. (1) Target location acquisition and occupancy map construction. (2) Trajectory and instruction annotation. (3) Video synthesis under constraints from the initial frame and annotated trajectories.
Figure 4: Qualitative results from real-world deployment. Best viewed in color and zoomed in for more details.
Figure 5: Visualization of the latent intention inference module. Best viewed in color and zoomed in for more details.
...and 4 more figures

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

TL;DR

Abstract

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Authors

TL;DR

Abstract

Table of Contents

Figures (9)