Table of Contents
Fetching ...

TimeRewind: Rewinding Time with Image-and-Events Video Diffusion

Jingxi Chen, Brandon Y. Feng, Haoming Cai, Mingyang Xie, Christopher Metzler, Cornelia Fermuller, Yiannis Aloimonos

TL;DR

This work tackles the ill-posed problem of recovering pre-capture motion from a single image by leveraging neuromorphic event cameras to provide motion cues. It introduces TimeRewind, a framework that freezes a pre-trained Img2Vid diffusion model and augments it with an Event Motion Adaptor (EMA) conditioned on event data to synthesize backward-time videos that are physically grounded. Through extensive experiments on the RGB-Event BS-ERGB dataset, TimeRewind achieves higher perceptual and fidelity metrics (e.g., PSNR, SSIM, LPIPS) than baselines and RGB-Event backbones, demonstrating robust backward-time video synthesis and improved motion realism. The approach offers practical insights for future consumer cameras and smartphones and opens new research directions at the convergence of event sensing and generative video modeling.

Abstract

This paper addresses the novel challenge of ``rewinding'' time from a single captured image to recover the fleeting moments missed just before the shutter button is pressed. This problem poses a significant challenge in computer vision and computational photography, as it requires predicting plausible pre-capture motion from a single static frame, an inherently ill-posed task due to the high degree of freedom in potential pixel movements. We overcome this challenge by leveraging the emerging technology of neuromorphic event cameras, which capture motion information with high temporal resolution, and integrating this data with advanced image-to-video diffusion models. Our proposed framework introduces an event motion adaptor conditioned on event camera data, guiding the diffusion model to generate videos that are visually coherent and physically grounded in the captured events. Through extensive experimentation, we demonstrate the capability of our approach to synthesize high-quality videos that effectively ``rewind'' time, showcasing the potential of combining event camera technology with generative models. Our work opens new avenues for research at the intersection of computer vision, computational photography, and generative modeling, offering a forward-thinking solution to capturing missed moments and enhancing future consumer cameras and smartphones. Please see the project page at https://timerewind.github.io/ for video results and code release.

TimeRewind: Rewinding Time with Image-and-Events Video Diffusion

TL;DR

This work tackles the ill-posed problem of recovering pre-capture motion from a single image by leveraging neuromorphic event cameras to provide motion cues. It introduces TimeRewind, a framework that freezes a pre-trained Img2Vid diffusion model and augments it with an Event Motion Adaptor (EMA) conditioned on event data to synthesize backward-time videos that are physically grounded. Through extensive experiments on the RGB-Event BS-ERGB dataset, TimeRewind achieves higher perceptual and fidelity metrics (e.g., PSNR, SSIM, LPIPS) than baselines and RGB-Event backbones, demonstrating robust backward-time video synthesis and improved motion realism. The approach offers practical insights for future consumer cameras and smartphones and opens new research directions at the convergence of event sensing and generative video modeling.

Abstract

This paper addresses the novel challenge of ``rewinding'' time from a single captured image to recover the fleeting moments missed just before the shutter button is pressed. This problem poses a significant challenge in computer vision and computational photography, as it requires predicting plausible pre-capture motion from a single static frame, an inherently ill-posed task due to the high degree of freedom in potential pixel movements. We overcome this challenge by leveraging the emerging technology of neuromorphic event cameras, which capture motion information with high temporal resolution, and integrating this data with advanced image-to-video diffusion models. Our proposed framework introduces an event motion adaptor conditioned on event camera data, guiding the diffusion model to generate videos that are visually coherent and physically grounded in the captured events. Through extensive experimentation, we demonstrate the capability of our approach to synthesize high-quality videos that effectively ``rewind'' time, showcasing the potential of combining event camera technology with generative models. Our work opens new avenues for research at the intersection of computer vision, computational photography, and generative modeling, offering a forward-thinking solution to capturing missed moments and enhancing future consumer cameras and smartphones. Please see the project page at https://timerewind.github.io/ for video results and code release.
Paper Structure (24 sections, 2 equations, 6 figures, 2 tables)

This paper contains 24 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: In the everyday use of smartphones for capturing images or videos, the process from opening the camera app to pressing the capture button involves preparation time. This includes aiming at the scene and deciding when to take the shot. Often, the moments we wish to capture occur during this preparation phase. Our work focuses on "rewinding" time from the point of capture to retrieve these missed moments.
  • Figure 2: Img2Vid models are trained for forward-time video synthesis. In contrast, the goal of TimeRewind is to synthesize the video backward into the pre-capture time.
  • Figure 3: Illustration of our proposed TimeRewind approach as an adaptor for the general Img2Vid architectures. The components shown in shades of blue (both dark and light) represent the elements of the original pre-trained model, which remain unchanged during our training process. The orange-colored components are specific to our TimeRewind and are being optimized throughout the training. $E$ denotes the VAE encoder to convert input captured images into latent space, $\tau$ is the diffuse time for a diffuse step, $f_{\theta}$ is the denoiser network, generally it is an UNet-like architecture with $N$ total number of up, mid and down blocks. Our EMA module contains N heads of Convolution Layers, SiLU activation hendrycks2016gaussian and downsample layers. It takes events $e$, diffuse time $\tau$, input latent $z_{\theta}$ as input conditions. The objective of EMA is to accurately predict the residual changes necessary to transfer the motion information from the input events to the input latent $z_{\theta}$. Through a series of iterative diffusing steps, this process ensures that the motion information is seamlessly integrated into the input latent, which is decoded into the final "TimeRewind" videos.
  • Figure 4: Illustration of our event image representation, here we show our event images accumulated from the before-capture-time events for 30 ms accumulation time window.
  • Figure 5: Comparison of backward-time video synthesis results on sequences where before-the-capture motion is simple. The motion in before-the-capture time mostly comes from a few rigid body objects. Left captures a ball being thrown upwards. Middle depicts a basketball spinning on a finger before being passed from the right hand to the left. Right shows juggling multiple balls between two hands. Note that since our task is backward-time video synthesis, as in shown reference frames, the correct backward-time video sequences are the reverse process of the above descriptions.
  • ...and 1 more figures