Table of Contents
Fetching ...

Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events

Shuoyan Wei, Feng Li, Shengeng Tang, Runmin Cong, Yao Zhao, Meng Wang, Huihui Bai

TL;DR

The paper tackles the challenge of robustly reconstructing high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales (C-STVSR). It introduces EvEnhancer, which fuses event streams with frame data through an Event-Adapted Synthesis Module (EASM) for long-term motion modeling and a Local Implicit Video Transformer (LIVT) for unified continuous video representations. To improve efficiency and generalization, EvEnhancerPlus adds a parameter-free Controllable Switch Mechanism (CSM) and a cross-derivative training strategy (CDTS) to adapt pixel-wise routing to varying reconstruction difficulties. Empirical results on synthetic and real-world datasets show state-of-the-art performance and strong OOD generalization, with EvEnhancerPlus delivering notable efficiency gains while maintaining accuracy. The work advances practical C-STVSR by integrating high-temporal-resolution event data with continuous implicit representations and scalable, adaptive computation.

Abstract

Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.

Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events

TL;DR

The paper tackles the challenge of robustly reconstructing high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales (C-STVSR). It introduces EvEnhancer, which fuses event streams with frame data through an Event-Adapted Synthesis Module (EASM) for long-term motion modeling and a Local Implicit Video Transformer (LIVT) for unified continuous video representations. To improve efficiency and generalization, EvEnhancerPlus adds a parameter-free Controllable Switch Mechanism (CSM) and a cross-derivative training strategy (CDTS) to adapt pixel-wise routing to varying reconstruction difficulties. Empirical results on synthetic and real-world datasets show state-of-the-art performance and strong OOD generalization, with EvEnhancerPlus delivering notable efficiency gains while maintaining accuracy. The work advances practical C-STVSR by integrating high-temporal-resolution event data with continuous implicit representations and scalable, adaptive computation.

Abstract

Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.

Paper Structure

This paper contains 30 sections, 16 equations, 12 figures, 14 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparisons of different C-STVSR methods, including VideoINR chen2022videoinr, MoTIF chen2023motif, BF-STVSR kim2025bf, HR-INR lu2024hr, our EvEnhancer and EvEnhancerPlus. (a)-(d) illustrate the different methodology to implement INR, highlighted with a gray background. (e) illustrates the PSNR (dB) comparison for different spatial upsampling scale $s$ and temporal upsampling scale $t$ on GoPro nah2017deep among these methods.
  • Figure 2: The overall backbone of our EvEnhancer and EvEnhancerPlus models consists of an event-adapted synthesis module (EASM), and a local implicit video transformer (LIVT).
  • Figure 3: The detail architecture of the event-adapted synthesis module (EASM), which contains two steps: (a) event-modulated alignment, and (b) bidirectional recurrent compensation. "EMB": event modulation block.
  • Figure 4: Structure of the local implicit video transformer (LIVT), which integrates 3D local spatiotemporal attention with implicit neural function to learn continuous video INR to reconstruct HR and HFR video frames.
  • Figure 5: The motivation of CSM. (a) is the ground-truth (GT) frame. (b) and (c) are the residual intensity maps that represent the difference between the GT frame and reconstructed frames by a simple upsampler $\mathcal{U}_0$ and a complex upsampler $\mathcal{U}_1$, respectively. (d) denotes the quality bias between (b) and (c). We use Eq. (\ref{['eq:14']}) to calculate the reconstruction difficulty of pixels (f) based on corresponding events (e). Based on Eq. (\ref{['eq:11']}), we can deploy the distributor $\mathcal{D}(\cdot)$ that derives the distribution map (g) from (f). The result (h) of the Hadamard product "$\bigodot$" between (d) and (g) reflects the reconstruction discrepancy between the remaining regions by EvEnhancerPlus.
  • ...and 7 more figures