Table of Contents
Fetching ...

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Finn Lukas Busch, Matti Vahs, Quantao Yang, Jesús Gerardo Ortega Peimbert, Yixi Cai, Jana Tumova, Olov Andersson

Abstract

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy for robustness to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on conventional forward navigation, achieving 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a phone camera directly enables different robots to navigate to any point on the trajectory. Videos, demo, and code are available at https://finnbusch.com/lotis.

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Abstract

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy for robustness to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on conventional forward navigation, achieving 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a phone camera directly enables different robots to navigate to any point on the trajectory. Videos, demo, and code are available at https://finnbusch.com/lotis.
Paper Structure (55 sections, 14 equations, 10 figures, 5 tables)

This paper contains 55 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Given only a reference trajectory of (unposed) RGB images $\mathcal{T}$, our model localizes the trajectory within the robot's current view. The predicted image-space coordinates, distances and visibility of the reference trajectory poses $(\mathbf{p}_i, d_i, v_i)$ provide robot-agnostic guidance for local planning, enabling different robots to go to any point on the trajectory, from any view of the trajectory.
  • Figure 2: LoTIS Architecture. Reference trajectory $\mathcal{T}$ and query $I_q$ are processed by frozen DINOv3 backbones. A trajectory encoder ($\mathcal{E}_T$) captures spatio-temporal context once (offline), while a query encoder ($\mathcal{E}_q$) and query-trajectory fusion ($\mathcal{F}_{qT}$) perform online feature extraction and fusion, respectively. Finally, a recurrent transformer iteratively regresses image-space coordinates ($\mathbf{p}_i$), visibility ($v_i$), and distances ($d_i$).
  • Figure 3: Relative success rate (SR) for all methods on off-trajectory initialization over initialization distance (left), compared to subgoal localization accuracy for baseline methods (right). LoTIS does not perform discrete subgoal selection and is therefore omitted from the right panel.
  • Figure 4: Four trajectories used for real-world evaluation. Each reference trajectory starts at and ends at , with initial experiment positions shown as . We present an offline-computed reconstruction of the environments murai2024_mast3rslam alongside representative views from: , (analog FPV for indoors, RealSense D455 for outdoors), and . Our model's trajectory predictions for the on-board cameras are overlaid in the corresponding views.
  • Figure 5: Impact of environment changes on the predictions of our model with respect to a reference trajectory recorded on a sunny autumn day. Top Right: Seasonal Change, Bottom Left: Seasonal and day-night change, Bottom Right: Seasonal, day-night change and people occluding the view.
  • ...and 5 more figures