Table of Contents
Fetching ...

Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

Friedhelm Hamann, Ziyun Wang, Ioannis Asmanis, Kenneth Chaney, Guillermo Gallego, Kostas Daniilidis

TL;DR

This work addresses dense, long-duration motion estimation from event cameras by overcoming the sim-to-real gap and lack of dense ground truth through a self-supervised contrast-maximization loss augmented with non-linear motion priors. It predicts dense per-pixel continuous-time trajectories $\mathbf{q}_n(t)$ via basis expansions and uses a soft association of events to $N_{\text{traj}}$ nearest trajectories, with a memory-efficient, differentiable warping pipeline built on a coarse displacement field and KNN implemented with KeOps. The loss maximizes the sharpness of warped events at a randomly chosen reference time $t_{\text{ref}}$, while regularization enforces spatial-smoothness and robustness across training references. Empirically, the approach yields a ~29% improvement in zero-shot EVIMO2 performance after synthetic pretraining and achieves state-of-the-art self-supervised results on the DSEC optical flow benchmark, with ~5x faster inference than baselines, demonstrating substantial practical impact for real-time, dense event-based motion estimation. Overall, the method generalizes across architectures and motion priors, reducing the reliance on GT while delivering accurate, continuous-time motion estimates suitable for robotics and vision tasks.

Abstract

Current optical flow and point-tracking methods rely heavily on synthetic datasets. Event cameras are novel vision sensors with advantages in challenging visual conditions, but state-of-the-art frame-based methods cannot be easily adapted to event data due to the limitations of current event simulators. We introduce a novel self-supervised loss combining the Contrast Maximization framework with a non-linear motion prior in the form of pixel-level trajectories and propose an efficient solution to solve the high-dimensional assignment problem between non-linear trajectories and events. Their effectiveness is demonstrated in two scenarios: In dense continuous-time motion estimation, our method improves the zero-shot performance of a synthetically trained model on the real-world dataset EVIMO2 by 29%. In optical flow estimation, our method elevates a simple UNet to achieve state-of-the-art performance among self-supervised methods on the DSEC optical flow benchmark. Our code is available at https://github.com/tub-rip/MotionPriorCMax.

Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

TL;DR

This work addresses dense, long-duration motion estimation from event cameras by overcoming the sim-to-real gap and lack of dense ground truth through a self-supervised contrast-maximization loss augmented with non-linear motion priors. It predicts dense per-pixel continuous-time trajectories via basis expansions and uses a soft association of events to nearest trajectories, with a memory-efficient, differentiable warping pipeline built on a coarse displacement field and KNN implemented with KeOps. The loss maximizes the sharpness of warped events at a randomly chosen reference time , while regularization enforces spatial-smoothness and robustness across training references. Empirically, the approach yields a ~29% improvement in zero-shot EVIMO2 performance after synthetic pretraining and achieves state-of-the-art self-supervised results on the DSEC optical flow benchmark, with ~5x faster inference than baselines, demonstrating substantial practical impact for real-time, dense event-based motion estimation. Overall, the method generalizes across architectures and motion priors, reducing the reliance on GT while delivering accurate, continuous-time motion estimates suitable for robotics and vision tasks.

Abstract

Current optical flow and point-tracking methods rely heavily on synthetic datasets. Event cameras are novel vision sensors with advantages in challenging visual conditions, but state-of-the-art frame-based methods cannot be easily adapted to event data due to the limitations of current event simulators. We introduce a novel self-supervised loss combining the Contrast Maximization framework with a non-linear motion prior in the form of pixel-level trajectories and propose an efficient solution to solve the high-dimensional assignment problem between non-linear trajectories and events. Their effectiveness is demonstrated in two scenarios: In dense continuous-time motion estimation, our method improves the zero-shot performance of a synthetically trained model on the real-world dataset EVIMO2 by 29%. In optical flow estimation, our method elevates a simple UNet to achieve state-of-the-art performance among self-supervised methods on the DSEC optical flow benchmark. Our code is available at https://github.com/tub-rip/MotionPriorCMax.
Paper Structure (11 sections, 6 equations, 7 figures, 8 tables)

This paper contains 11 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Summary. a) We present an approach to combine Contrast Maximization with dense non-linear trajectories. b) We show how it can be used for self-supervised learning in a pipeline to predict dense point trajectories, and c) evaluate it on the EVIMO2 dataset, for which we generate dense point tracks. d) Additionally, our approach provides state-of-the-art performance on self-supervised optical flow prediction.
  • Figure 2: Pipeline overview. (a) Input events in a time interval are (b) voxelized and (c) passed to an artificial neural network that predicts per-pixel coefficients for continuous-time trajectories (d). The raw events and predicted trajectories are fed to the loss module (e). Here, a dense spatio-temporal displacement map is interpolated, and events are warped according to their looked-up displacement. Lastly, an image of warped events (IWE) is built at a random reference time and its gradient magnitude acts as training loss. Note that the prediction method displayed here is specific to the used Bflow backbone Gehrig24pami, with additional events before the prediction start time $t_s$ as input.
  • Figure 3: Visualization of predicted trajectories on EVIMO2 data. GT: Ground truth. In-domain: fine-tuned on EVIMO2 using GT (supervised). Zero-shot: network trained only on synthetic data (out-of-domain prediction). Ours: Pre-trained on synthetic data, fine-tuned with self-supervised loss. Note that supervision in-domain is often impossible in practice because dense trajectory labels for real data are difficult to obtain.
  • Figure 4: End-point-error vs. prediction time span for three methods: in-distribution, out-of-distribution and self-supervised (Ours). Using the Bézier curve results from \ref{['tab:exp:evimo_curve_sensit']}.
  • Figure 5: Results on DSEC. Image of warped events and predicted flow by three methods.
  • ...and 2 more figures