Space-Time Video Super-resolution with Neural Operator
Yuantong Zhang, Hanyou Zheng, Daiqin Yang, Zhenzhong Chen, Haichuan Ma, Wenpeng Ding
TL;DR
This work tackles space-time video super-resolution (ST-VSR) by reframing the problem as neural-operator learning between coarse intra-frame representations and fine-grained inter-frame representations. It introduces STNO, a neural-operator framework that uses a Galerkin-type kernel attention to perform motion estimation and compensation with a global receptive field and linear complexity, enabling efficient handling of large motions. The architecture comprises three stages—input projection, kernel integration, and output projection—with bidirectional temporal propagation and spatial modulation, all without patch-based processing. Empirical results on fixed and continuous ST-VSR tasks show that STNO achieves state-of-the-art performance with faster inference and fewer parameters, validating the effectiveness of the neural-operator approach for complex inter-frame restoration.
Abstract
This paper addresses the task of space-time video super-resolution (ST-VSR). Existing methods generally suffer from inaccurate motion estimation and motion compensation (MEMC) problems for large motions. Inspired by recent progress in physics-informed neural networks, we model the challenges of MEMC in ST-VSR as a mapping between two continuous function spaces. Specifically, our approach transforms independent low-resolution representations in the coarse-grained continuous function space into refined representations with enriched spatiotemporal details in the fine-grained continuous function space. To achieve efficient and accurate MEMC, we design a Galerkin-type attention function to perform frame alignment and temporal interpolation. Due to the linear complexity of the Galerkin-type attention mechanism, our model avoids patch partitioning and offers global receptive fields, enabling precise estimation of large motions. The experimental results show that the proposed method surpasses state-of-the-art techniques in both fixed-size and continuous space-time video super-resolution tasks.
