Table of Contents
Fetching ...

Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

Jingxi Chen, Brandon Y. Feng, Haoming Cai, Tianfu Wang, Levi Burner, Dehao Yuan, Cornelia Fermuller, Christopher A. Metzler, Yiannis Aloimonos

TL;DR

This paper tackles EVFI by repurposing pre-trained video diffusion foundation models, addressing data scarcity and generalization across unseen real-world sequences. It introduces RE-VDM, which leverages data-efficient adaptation (freezing base weights while learning a small event-conditioned residual), event-conditioned control via a multi-stack event representation, Per-tile Denoising and Fusion to maintain high-fidelity details, and Two-side Fusion to enable frame interpolation by leveraging information from both start and end frames at each denoising step. The approach demonstrates strong generalization on real-world datasets, including a new Clear-Motion test suite, outperforming frame-only VFI, EVFI baselines, and test-time optimization baselines in PSNR, SSIM, and LPIPS, while revealing practical limitations such as memory demands and VAE-related detail losses. Taken together, the work bridges EVFI with generative AI by showing that large-scale video priors in diffusion models can be adapted to EVFI tasks, enabling robust interpolation and paving the way for cross-domain synthesis in event-driven imaging.

Abstract

Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.

Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

TL;DR

This paper tackles EVFI by repurposing pre-trained video diffusion foundation models, addressing data scarcity and generalization across unseen real-world sequences. It introduces RE-VDM, which leverages data-efficient adaptation (freezing base weights while learning a small event-conditioned residual), event-conditioned control via a multi-stack event representation, Per-tile Denoising and Fusion to maintain high-fidelity details, and Two-side Fusion to enable frame interpolation by leveraging information from both start and end frames at each denoising step. The approach demonstrates strong generalization on real-world datasets, including a new Clear-Motion test suite, outperforming frame-only VFI, EVFI baselines, and test-time optimization baselines in PSNR, SSIM, and LPIPS, while revealing practical limitations such as memory demands and VAE-related detail losses. Taken together, the work bridges EVFI with generative AI by showing that large-scale video priors in diffusion models can be adapted to EVFI tasks, enabling robust interpolation and paving the way for cross-domain synthesis in event-driven imaging.

Abstract

Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.

Paper Structure

This paper contains 28 sections, 3 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: The overview of our proposed approach RE-VDM: for adapting pre-trained video diffusion models includes two tasks: event-based video generation and interpolation. For video generation, the method utilizes the start frame $I_{s}$ and forward-time events $E_{s}$. For interpolation, it incorporates both the start frame $I_{s}$ and forward-time events $E_{s}$, as well as the end frame $I_{e}$ and backward-time events $E_{e}$ to achieve consistent results. Unlike video generation, for interpolation, a denoising step $t$ concludes with Two-side Fusion instead of EVDS.
  • Figure 2: The illustration depicts our training scheme for adapting a pre-trained video diffusion model to event-based video denoising. Our approach uses a frozen denoiser network from the pre-trained model, augmented with a trainable subset of blocks copied from the frozen denoiser.
  • Figure 3: Our multi-stack event representation is illustrated as follows. The stack begins from the target frame at $t_{i}$ and expands backward in time to the previous frame $t_{i-1}$. Within each stack, the number of events accumulated from $t_{i}$ is halved from the previous stack. In the long stack ($m=0$), the slowest-moving objects, such as around the human head, appear sharp; in the middle stack ($m=1$), slower-moving objects, like the human arm, are clear; and in the short stack ($m=2$), the fastest-moving objects, such as the ball, are sharp. This approach ensures that the event data provides adequate control information for generating frame $t_{i}$.
  • Figure 4: The VAE encoding/decoding loss of small details in the original input image. In (a), we show the original image, sized 970 x 625. In (b), we pad the image to the nearest multiple of 8, then encode and decode it back. After decoding, the PSNR drops to 21.50, with noticeable detail loss in the zoomed-in view. In (c), we pad to the nearest multiple of 8, upsample to twice the original width and height, then encode and decode. After decoding, the PSNR increases to around 24.92, with no significant loss of details in the zoomed-in view.
  • Figure 5: The Per-tile Denoising and Fusion process is a test-time optimization applied during inference to enhance the fidelity of video generation appearance and improve event-based motion control accuracy. During each denoising step, our model operates on upsampled tiles of the input image and event representations to predict denoised latents for each tile $i$ at a denoising time $t$ ($Z_{t-1}^{i}$). These denoised tile latents are then accumulated to obtain the predicted denoised latents for the entire video (${\Tilde{Z}}_{t-1}$) at that denoising step $t$.
  • ...and 13 more figures