Table of Contents
Fetching ...

HR-INR: Continuous Space-Time Video Super-Resolution via Event Camera

Yunfan Lu, Yusheng Wang, Zipeng Wang, Pengteng Li, Bin Yang, Hui Xiong

TL;DR

HR-INR leverages high-temporal-resolution event data to enable continuous space-time video super-resolution. It introduces Temporal Pyramid Representation to capture regional fast motion and combines regional and holistic feature extraction with an INR-based spatiotemporal decoder that uses temporal and spatial embeddings for arbitrary-scale outputs. Across four datasets, HR-INR achieves state-of-the-art performance and superior temporal stability compared with both frame-based and prior event-guided methods, while maintaining compact model size and efficient inference. The approach significantly advances practical video enhancement in dynamic scenes and opens avenues for broader event-driven video processing tasks.

Abstract

Continuous space-time video super-resolution (C-STVSR) aims to simultaneously enhance video resolution and frame rate at an arbitrary scale. Recently, implicit neural representation (INR) has been applied to video restoration, representing videos as implicit fields that can be decoded at an arbitrary scale. However, existing INR-based C-STVSR methods typically rely on only two frames as input, leading to insufficient inter-frame motion information. Consequently, they struggle to capture fast, complex motion and long-term dependencies (spanning more than three frames), hindering their performance in dynamic scenes. In this paper, we propose a novel C-STVSR framework, named HR-INR, which captures both holistic dependencies and regional motions based on INR. It is assisted by an event camera -- a novel sensor renowned for its high temporal resolution and low latency. To fully utilize the rich temporal information from events, we design a feature extraction consisting of (1) a regional event feature extractor -- taking events as inputs via the proposed event temporal pyramid representation to capture the regional nonlinear motion and (2) a holistic event-frame feature extractor for long-term dependence and continuity motion. We then propose a novel INR-based decoder with spatiotemporal embeddings to capture long-term dependencies with a larger temporal perception field. We validate the effectiveness and generalization of our method on four datasets (both simulated and real data), showing the superiority of our method. The project page is available at https://github.com/yunfanLu/HR-INR

HR-INR: Continuous Space-Time Video Super-Resolution via Event Camera

TL;DR

HR-INR leverages high-temporal-resolution event data to enable continuous space-time video super-resolution. It introduces Temporal Pyramid Representation to capture regional fast motion and combines regional and holistic feature extraction with an INR-based spatiotemporal decoder that uses temporal and spatial embeddings for arbitrary-scale outputs. Across four datasets, HR-INR achieves state-of-the-art performance and superior temporal stability compared with both frame-based and prior event-guided methods, while maintaining compact model size and efficient inference. The approach significantly advances practical video enhancement in dynamic scenes and opens avenues for broader event-driven video processing tasks.

Abstract

Continuous space-time video super-resolution (C-STVSR) aims to simultaneously enhance video resolution and frame rate at an arbitrary scale. Recently, implicit neural representation (INR) has been applied to video restoration, representing videos as implicit fields that can be decoded at an arbitrary scale. However, existing INR-based C-STVSR methods typically rely on only two frames as input, leading to insufficient inter-frame motion information. Consequently, they struggle to capture fast, complex motion and long-term dependencies (spanning more than three frames), hindering their performance in dynamic scenes. In this paper, we propose a novel C-STVSR framework, named HR-INR, which captures both holistic dependencies and regional motions based on INR. It is assisted by an event camera -- a novel sensor renowned for its high temporal resolution and low latency. To fully utilize the rich temporal information from events, we design a feature extraction consisting of (1) a regional event feature extractor -- taking events as inputs via the proposed event temporal pyramid representation to capture the regional nonlinear motion and (2) a holistic event-frame feature extractor for long-term dependence and continuity motion. We then propose a novel INR-based decoder with spatiotemporal embeddings to capture long-term dependencies with a larger temporal perception field. We validate the effectiveness and generalization of our method on four datasets (both simulated and real data), showing the superiority of our method. The project page is available at https://github.com/yunfanLu/HR-INR
Paper Structure (13 sections, 5 equations, 22 figures, 7 tables)

This paper contains 13 sections, 5 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: With event data as guidance, our method (HR-INR) takes in videos with low frame rates and resolution (a) and produces continuous space-time videos with arbitrary frame rate and resolution (b). Demonstrating effective modeling of local nonlinear motion, our method uniquely showcases this with the bicycle example in (c), a feat unachievable by the prior method - VideoINR chen2022videoinr. As shown in (c), our method is able to recover the rotation of the bicycle wheels, which is unachievable by the prior method VideoINR chen2022videoinr.
  • Figure 2: Overview of our framework. The inputs are multi-frame images and their corresponding events. The output is a video with enhanced frame rates and resolutions. Firstly, events proximate to a particular time point are transformed into Temporal Pyramid Representations (TPR) to capture motion at a more granular temporal level (a). Secondly, TPRs, the comprehensive set of multi-frames and events, are directed into the feature extraction part (b). Within this part, the Regional Events Feature Extractor and the Holistic Events Feature Extractor process the input separately. Lastly, the resulting features are then fused and inputted into an INR-based spatiotemporal decoding part (c). Within this part, a temporal embedding is executed to capture features at a specific timestamp $t$, followed by spatial embedding with an up-sampling factor $s$ and decoding, culminating in the generation of frames at the desired resolution.
  • Figure 3: Visualization of Event TPR across different time resolutions. The TPR is divided into 7 layers ($L=7$), with each layer having a time resolution that is $1/3$ of the previous layer ($r = 3$). The time resolution for the first layer ($L_0$) is approximately $2\times \Delta t = 1/9 s$, and the time resolution for the seventh layer ($L_6$) is approximately $1/6561 s$. Each layer visualizes the corresponding event data at a specific time resolution, demonstrating the effect of varying temporal resolution on event-based data.
  • Figure 4: Holistic event-frame feature extractor. The down-sample module will halve the resolution. The up-sample module will double the resolution. The encoder and decoder have the same structure as Swin-Transformer liu2022videoliu2021swingeng2022rstt.
  • Figure 5: Temporal embedding. Given the input time $t \in [0,1]$, the output is the temporal attention $E_t$ derived from a two-layer MLP. (b) presents a visualization of the trained $E_t$ during $[0,1]$ on real-world dataset tulyakov2022time.
  • ...and 17 more figures