Table of Contents
Fetching ...

YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Ryan Meegan, Adam D'Souza, Bryan Bo Cao, Shubham Jain, Kristin Dana

TL;DR

YOPO-Nav addresses visual navigation without full 3D maps by building a graph of local 3D Gaussian Splatting nodes from one-pass videos. It fuses coarse localization via Visual Place Recognition with fine-grained pose estimation inside local 3DGS nodes to generate navigation actions, demonstrated on the YOPO-Campus dataset collected with a real Jackal robot. The approach outperforms zero-shot baselines on image-goal navigation, showing robustness to appearance changes and scalability through a lightweight, interpretable scene representation. By combining topological structure with fast local geometry, YOPO-Nav enables reliable, real-world robotic navigation from limited data with optional human interventions.

Abstract

Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.

YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

TL;DR

YOPO-Nav addresses visual navigation without full 3D maps by building a graph of local 3D Gaussian Splatting nodes from one-pass videos. It fuses coarse localization via Visual Place Recognition with fine-grained pose estimation inside local 3DGS nodes to generate navigation actions, demonstrated on the YOPO-Campus dataset collected with a real Jackal robot. The approach outperforms zero-shot baselines on image-goal navigation, showing robustness to appearance changes and scalability through a lightweight, interpretable scene representation. By combining topological structure with fast local geometry, YOPO-Nav enables reliable, real-world robotic navigation from limited data with optional human interventions.

Abstract

Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.

Paper Structure

This paper contains 27 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: YOPO-Campus Dataset Birds-eye-view (BEV) of the paths traversed by the Jackal robot under human teleoperation across Rutgers University, Busch Campus. Egocentric video from the robot and action control sequences are captured over 4 hours across 6km.
  • Figure 2: YOPO-Nav: Videos of human-teleoperated robot trajectories are used to construct a scene representation as a graph of local 3DGS nodes. When the embodied agent revisits the scene, coarse localization is performed using Visual Place Recognition (VPR), linking frames in the the recorded trajectory to a corresponding node in the 3DGS graph. The robot’s real-world pose, $p'$, is localized in the 3DGS node via PnP RANSAC, and the difference from the desired pose, $p$, yields a transformation matrix, directing actions to align with $p$ (see Section \ref{['subsec:yoponav']}.)
  • Figure 3: YOPO-Nav Scene Representation YOPO-Nav represents a scene as a graph of 3DGS models, built from $\sim$50–55 frames at $448 \times$336 resolution, using videos of human-teleoperated robot trajectories. Edges connect nodes by frame continuity (within each video) or by visual similarity (across different videos). Navigation proceeds by localizing new camera observations in the 3DGS and aligning them to the estimated poses from the videos.
  • Figure 4: YOPO-Campus Dataset Viewer GUI for efficient visualization of YOPO‑Campus: left: bird’s‑eye view annotated with routers, planned path, and robot position based on the current frame; right: RGB/depth images; center: frame data (timestamp, action, FTM/RSSI, compass, GPS) with a player to view the data associated with each frame in the video.
  • Figure 5: YOPO-Nav GUI The YOPO-Nav GUI is comprised of five widgets: 1) the top-left displays the Jackal robot's live camera feed; (2) the bottom-left shows the camera feed's closest matched frame in the FAISS douze2025faiss index; (3) the center presents a BEV of the campus with the robot’s position, desired goal, and planned path; (4) the top-right displays the start and goal images, next frame in the planned path, and performance metrics (actions, time, interventions); and (5) the bottom-right renders the 3DGS model and simulated actions in real-time.