Table of Contents
Fetching ...

VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang, Zeel Bhatt, Yezhou Yang

TL;DR

VOCAL tackles the interpretability gap in learning-based visual odometry by recasting VO as a label-ranking problem and grounding it in a Bayesian-inspired latent space. It replaces geometric graphs with a Plackett-Luce–based ranking of observations by camera state, trained with a supervised Rank-N-Contrast loss to produce continuous, interpretable features that align with motion. The architecture consists of a Contrastive Feature Encoder and a Pose Estimation Decoder that regresses 6-DoF pose from two-frame optical flow, achieving competitive KITTI results while enabling seamless multimodal integration. The latent representations exhibit a meaningful gradient with motion cues, supporting data-efficient generalization and providing a bridge to other learning-based systems for spatial intelligence.

Abstract

Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL's enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

VOCAL: Visual Odometry via ContrAstive Learning

TL;DR

VOCAL tackles the interpretability gap in learning-based visual odometry by recasting VO as a label-ranking problem and grounding it in a Bayesian-inspired latent space. It replaces geometric graphs with a Plackett-Luce–based ranking of observations by camera state, trained with a supervised Rank-N-Contrast loss to produce continuous, interpretable features that align with motion. The architecture consists of a Contrastive Feature Encoder and a Pose Estimation Decoder that regresses 6-DoF pose from two-frame optical flow, achieving competitive KITTI results while enabling seamless multimodal integration. The latent representations exhibit a meaningful gradient with motion cues, supporting data-efficient generalization and providing a bridge to other learning-based systems for spatial intelligence.

Abstract

Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL's enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

Paper Structure

This paper contains 17 sections, 14 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: (a) Conventional graph-based visual odometry, where connections between camera states $x_i$ and features $f_j$ are modeled using predefined graphs $z_{ij}$. This manual design limits both flexibility and interpretability in learning-based VO systems. (b) The VOCAL architecture eliminates the need for handcrafted graph structures by reframing VO as a label-ranking problem. Through contrastive learning, VOCAL organizes features extracted from visual inputs based on their corresponding camera states, ensuring that inputs with similar camera states yield consistent features in the latent space. Our approach improves spatial understanding in visual odometry by establishing a direct correlation between feature representations and 3D camera states.
  • Figure 2: High-Level Idea: Panels (a) and (b) show visual inputs from different environments that share the same camera state ("Forward 5 meters"), whereas panels (b) and (c) depict inputs from the same scene but with different camera states ("Forward 5 meters" vs. "Forward 3 meters"). Just as humans can recognize the same motion regardless of environmental differences—and distinguish different motions even in similar scenarios—our approach uses contrastive learning to align features corresponding to similar camera states while separating those corresponding to different states.
  • Figure 3: Gaussian Model vs. Plackett–Luce Model in Learning-based VO: Most learning-based VO methods rely on geometric loss functions derived from a Gaussian assumption, limiting their alignment with the learning process. In contrast, VOCAL adopts the Plackett–Luce model and employs the Supervised Rank-N-Contrast loss ($L_{SupRNC}$) loss to rank feature representations according to their respective camera states, providing greater flexibility and a clearer interpretation of spatial relationships.
  • Figure 4: System Overview: Our system comprises two main components: a Contrastive Feature Encoder and a Pose Estimation Decoder. The encoder processes optical flow and its augmented variants using a ResNet to generate observation feature vectors. These features are then fed into the Pose Estimation Decoder, which employs Multi-Layer Perceptrons (MLPs) to estimate camera states. During training, the Supervised Rank-N-Contrast loss ($L_{SupRNC}$) ranks the features based on camera states, yielding a spatially meaningful and interpretable latent space that facilitates the estimation of the most likely camera states.
  • Figure 5: Feature Distribution in Latent Space: Lighter features (yellow) correspond to larger camera motions, while darker features (purple) denote smaller motions. The results, based on KITTI sequences 03, 05, 07, and 10, reveal a continuous gradient from lighter to darker features, highlighting the effective ranking of features according to their camera states.
  • ...and 7 more figures