Table of Contents
Fetching ...

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

Phuc D. A. Nguyen, Anh N. Nhu, Ming C. Lin

TL;DR

OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.

Abstract

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

TL;DR

OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.

Abstract

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.
Paper Structure (15 sections, 18 equations, 9 figures, 8 tables)

This paper contains 15 sections, 18 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Left: Generalized Visual Odometry provides real-world ego-motion and trajectory estimates that bridge perception and control in autonomous driving. It enables scene understanding (Driving VQA qian2024nuscenesqawei2025driveqa), simulation (Real2Sim liu2023vectormapnetshi2024globalmapnet), action grounding (Driving VLA hwang2024emmali2025drivevla), and precise motion feedback for low-level control kim2020advisablechen2022learningxu2025drivegpt4. Right: We introduce OpenVO, a generalizable visual odometry framework that estimates real-world ego-motion from uncalibrated dashcam videos and remains robust across varying observation rates. Our design integrates our "Time-Aware Flow Encoder" for modeling temporal dynamics and a "Geometry-Aware Context Encoder" for extracting consistent scene geometry, enabling robust and generalizable motion estimation across diverse visual and temporal domains.
  • Figure 2: Overview of OpenVO. We propose a novel temporal-dynamics-informed, geometry-aware visual odometry system. Our method takes consecutive dashcam frames as input and extracts both temporal and geometric representations for robust egomotion estimation. The Time-Aware Flow Encoder (Sec. \ref{['sec:timeawareenc']}) leverages a Differentiable 2D-Guided 3D Flow module and time-conditioned embeddings to model motion dynamics across varying observation rates, while the Geometry-Aware Context Encoder (Sec. \ref{['sec:geoawareenc']}) incorporates metric depth and intrinsic priors to build a consistent 3D geometry structure of the scene. Finally, the World-Coordinate Egomotion Decoder (Sec. \ref{['sec:worldcoordtraining']}) predicts accurate world-coordinate egomotion trajectories from the fused dynamic-geometric representation.
  • Figure 3: Qualitative results. We present trajectory prediction results on the KITTI and nuScenes datasets. Compared to ZeroVO$^{\ddagger}$, both variants of our method --- differentiable (OpenVO-diff) and non-differentiable variants (OpenVO-nodiff) of our 2D-guided 3D flow --- achieve higher trajectory prediction accuracy and consistency, surpassing the current state-of-the-art.
  • Figure 4: Modified VectorMapNet liu2023vectormapnet. A front-view input image is first processed by an image encoder to extract semantic and geometric features. These features are then lifted into a bird’s-eye-view (BEV) representation using inverse perspective mapping, which leverages the camera’s intrinsic and extrinsic parameters from OpenVO to geometrically project image features onto the ground plane. The resulting BEV feature map is fed into the Vector Map Decoder, which predicts structured map elements in an intermediate representation consisting of key components such as polyline classes, control points, and geometric attributes. Finally, a polyline generator converts these decoded components into continuous vectorized map elements, such as lane boundaries, road dividers, and crosswalks -- yielding a high-resolution, topologically meaningful HD map suitable for downstream driving tasks.
  • Figure 5: Qualitative results of Global HDMap reconstruction results produced by OpenVO + modified monocular VectorMapNet liu2023vectormapnet. Local mapping outputs are gradually fused through OpenVO’s ego-to-world pose estimates, producing a coherent global HD-map reconstruction of the full scenario. We would like to refer to our supplementary videos for further details of the OpenVO-enabled monocular-based global map reconstruction.
  • ...and 4 more figures