Table of Contents
Fetching ...

TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim

Abstract

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Abstract

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only 25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.
Paper Structure (72 sections, 14 equations, 18 figures, 13 tables)

This paper contains 72 sections, 14 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: Training data scale of event-based trackers.
  • Figure 1: Layer and head ablation for matching property.
  • Figure 2: Real event vs. synthetic event analysis. (a) Inter-Event Interval (IEI) distributions show that real events concentrate in short intervals with rapid decay, while synthetic events exhibit long tails and periodic artifacts. (b) The appearance of real and synthetic events differs substantially, with synthetic events from V2E hu2021v2e and Voltmeter lin2022dvs exhibiting artifacts not present in real-world recordings.
  • Figure 2: Visualization of Event Motion Mask $\mathcal{M}_{\text{event}}$.
  • Figure 3: Object motion query sampling. Given teacher-predicted optical flow, we estimate a global affine model via RANSAC and compute residual flow to identify independently moving regions. Queries are oversampled from these regions to prevent bias toward dominant ego-motion patterns.
  • ...and 13 more figures