Space-Time Video Super-resolution with Neural Operator

Yuantong Zhang; Hanyou Zheng; Daiqin Yang; Zhenzhong Chen; Haichuan Ma; Wenpeng Ding

Space-Time Video Super-resolution with Neural Operator

Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, Wenpeng Ding

TL;DR

This work tackles space-time video super-resolution (ST-VSR) by reframing the problem as neural-operator learning between coarse intra-frame representations and fine-grained inter-frame representations. It introduces STNO, a neural-operator framework that uses a Galerkin-type kernel attention to perform motion estimation and compensation with a global receptive field and linear complexity, enabling efficient handling of large motions. The architecture comprises three stages—input projection, kernel integration, and output projection—with bidirectional temporal propagation and spatial modulation, all without patch-based processing. Empirical results on fixed and continuous ST-VSR tasks show that STNO achieves state-of-the-art performance with faster inference and fewer parameters, validating the effectiveness of the neural-operator approach for complex inter-frame restoration.

Abstract

This paper addresses the task of space-time video super-resolution (ST-VSR). Existing methods generally suffer from inaccurate motion estimation and motion compensation (MEMC) problems for large motions. Inspired by recent progress in physics-informed neural networks, we model the challenges of MEMC in ST-VSR as a mapping between two continuous function spaces. Specifically, our approach transforms independent low-resolution representations in the coarse-grained continuous function space into refined representations with enriched spatiotemporal details in the fine-grained continuous function space. To achieve efficient and accurate MEMC, we design a Galerkin-type attention function to perform frame alignment and temporal interpolation. Due to the linear complexity of the Galerkin-type attention mechanism, our model avoids patch partitioning and offers global receptive fields, enabling precise estimation of large motions. The experimental results show that the proposed method surpasses state-of-the-art techniques in both fixed-size and continuous space-time video super-resolution tasks.

Space-Time Video Super-resolution with Neural Operator

TL;DR

Abstract

Paper Structure (14 sections, 12 equations, 9 figures, 5 tables)

This paper contains 14 sections, 12 equations, 9 figures, 5 tables.

Introduction
Related Work
Space-Time Video Super-Resolution
Neural Operators
Methodology
Problem Formulation
Network Architecture
Experiments
Experiments Setup
Comparisons to State-of-the-Arts
Computation Complexity Analysis
Visualization of Galerkin-type kernel Function
Ablation Study
Conclusions and Future Work

Figures (9)

Figure 1: Visualization of the Navier-Stokes equation problem, Physics-Informed Neural Operator (PINN) focuses on two main issues: (1) Sequence prediction problem: Given a time series of fluid fields as input, the neural operator aims to predict fluid changes for the next time interval. (e.g., (a) $\rightarrow$ (b)). (2) Zero-shot super-resolution, which involves training on lower resolution data with coarse-grained discretization and evaluating on higher resolution data with fine-grained discretization. (e.g., Training: (a) $\rightarrow$ (b), evaluation: (c) $\rightarrow$ (d)).)
Figure 2: Overview of the proposed method. We first extract multi-scale coarse-grained intra representations $F_{\{0,1\}}^{\{1,2,3\}}$, which a kernel-integrated operator subsequently processes to perform multi-scale MEMC. The obtained coarse-grained features $F^{C}$ and motion information are further enhanced through multi-frame information propagation, resulting in fine-grained representations $F^{f}$, which contain rich spatiotemporal information. Finally, the obtained $F^{f}$ are used for upsampling.
Figure 3: The Global Feature Aggregation module comprises two main components: Texture Feature Aggregation and Motion Feature Aggregation. We employ a Galerkin-type attention mechanism to capture global texture features $Te_{t}$ and motion features $Mo_{t}$. The generated $Te_{t}$ and $Mo_{t}$ are coupled together, mutually enhancing each other, ultimately resulting in high-quality interpolated intermediate frame features and motion flow.
Figure 4: Synthetic intermediate frame by different methods for large motion on Adobe. Pay attention to the areas outlined in red boxes, and zoom in for a better view.
Figure 5: Synthetic intermediate frame by different methods for large motion on Adobe. Pay attention to the areas outlined in red boxes, and zoom in for a better view.
...and 4 more figures

Space-Time Video Super-resolution with Neural Operator

TL;DR

Abstract

Space-Time Video Super-resolution with Neural Operator

Authors

TL;DR

Abstract

Table of Contents

Figures (9)