Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Finn Lukas Busch; Matti Vahs; Quantao Yang; Jesús Gerardo Ortega Peimbert; Yixi Cai; Jana Tumova; Olov Andersson

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Finn Lukas Busch, Matti Vahs, Quantao Yang, Jesús Gerardo Ortega Peimbert, Yixi Cai, Jana Tumova, Olov Andersson

Abstract

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy for robustness to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on conventional forward navigation, achieving 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a phone camera directly enables different robots to navigate to any point on the trajectory. Videos, demo, and code are available at https://finnbusch.com/lotis.

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Abstract

Paper Structure (55 sections, 14 equations, 10 figures, 5 tables)

This paper contains 55 sections, 14 equations, 10 figures, 5 tables.

Introduction
Related Work
Visual Navigation
Learned Visual Geometry
Problem Statement
Method
Guidance Representation
Model Architecture
Trajectory Encoder $\mathcal{E}_\mathrm{T}$
Query Encoder $\mathcal{E}_\mathrm{q}$
Query-Trajectory Fusion $\mathcal{F}_\mathrm{qT}$
Prediction Head $\mathcal{P}$
Implementation
Training
Cross-Trajectory Training
...and 40 more sections

Figures (10)

Figure 1: Given only a reference trajectory of (unposed) RGB images $\mathcal{T}$, our model localizes the trajectory within the robot's current view. The predicted image-space coordinates, distances and visibility of the reference trajectory poses $(\mathbf{p}_i, d_i, v_i)$ provide robot-agnostic guidance for local planning, enabling different robots to go to any point on the trajectory, from any view of the trajectory.
Figure 2: LoTIS Architecture. Reference trajectory $\mathcal{T}$ and query $I_q$ are processed by frozen DINOv3 backbones. A trajectory encoder ($\mathcal{E}_T$) captures spatio-temporal context once (offline), while a query encoder ($\mathcal{E}_q$) and query-trajectory fusion ($\mathcal{F}_{qT}$) perform online feature extraction and fusion, respectively. Finally, a recurrent transformer iteratively regresses image-space coordinates ($\mathbf{p}_i$), visibility ($v_i$), and distances ($d_i$).
Figure 3: Relative success rate (SR) for all methods on off-trajectory initialization over initialization distance (left), compared to subgoal localization accuracy for baseline methods (right). LoTIS does not perform discrete subgoal selection and is therefore omitted from the right panel.
Figure 4: Four trajectories used for real-world evaluation. Each reference trajectory starts at and ends at , with initial experiment positions shown as . We present an offline-computed reconstruction of the environments murai2024_mast3rslam alongside representative views from: , (analog FPV for indoors, RealSense D455 for outdoors), and . Our model's trajectory predictions for the on-board cameras are overlaid in the corresponding views.
Figure 5: Impact of environment changes on the predictions of our model with respect to a reference trajectory recorded on a sunny autumn day. Top Right: Seasonal Change, Bottom Left: Seasonal and day-night change, Bottom Right: Seasonal, day-night change and people occluding the view.
...and 5 more figures

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Abstract

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Authors

Abstract

Table of Contents

Figures (10)