Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
Jingxi Chen, Brandon Y. Feng, Haoming Cai, Tianfu Wang, Levi Burner, Dehao Yuan, Cornelia Fermuller, Christopher A. Metzler, Yiannis Aloimonos
TL;DR
This paper tackles EVFI by repurposing pre-trained video diffusion foundation models, addressing data scarcity and generalization across unseen real-world sequences. It introduces RE-VDM, which leverages data-efficient adaptation (freezing base weights while learning a small event-conditioned residual), event-conditioned control via a multi-stack event representation, Per-tile Denoising and Fusion to maintain high-fidelity details, and Two-side Fusion to enable frame interpolation by leveraging information from both start and end frames at each denoising step. The approach demonstrates strong generalization on real-world datasets, including a new Clear-Motion test suite, outperforming frame-only VFI, EVFI baselines, and test-time optimization baselines in PSNR, SSIM, and LPIPS, while revealing practical limitations such as memory demands and VAE-related detail losses. Taken together, the work bridges EVFI with generative AI by showing that large-scale video priors in diffusion models can be adapted to EVFI tasks, enabling robust interpolation and paving the way for cross-domain synthesis in event-driven imaging.
Abstract
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
