Table of Contents
Fetching ...

MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

Hua Chang, Xin Xu, Wei Liu, Wei Wang, Xin Yuan, Kui Jiang

TL;DR

This work tackles the preservation and restoration of Chinese opera videos by addressing domain-specific challenges and large inter-frame motions. It introduces the COVC dataset and a Mamba-based multiscale fusion network (MambaOVSR) that combines a Global Fusion Module, Multiscale Synergistic Mamba Module, and MambaVR to model global motion and align sequences of varying lengths. Across COVC and Vimeo90K, MambaOVSR achieves state-of-the-art performance, with notable PSNR improvements and clearer, detail-rich reconstructions in opera footage. The dataset and code release, along with demonstrated improvements in handling large motions, significantly advance archival restoration and high-fidelity opera video synthesis.

Abstract

Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera's characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.

MambaOVSR: Multiscale Fusion with Global Motion Modeling for Chinese Opera Video Super-Resolution

TL;DR

This work tackles the preservation and restoration of Chinese opera videos by addressing domain-specific challenges and large inter-frame motions. It introduces the COVC dataset and a Mamba-based multiscale fusion network (MambaOVSR) that combines a Global Fusion Module, Multiscale Synergistic Mamba Module, and MambaVR to model global motion and align sequences of varying lengths. Across COVC and Vimeo90K, MambaOVSR achieves state-of-the-art performance, with notable PSNR improvements and clearer, detail-rich reconstructions in opera footage. The dataset and code release, along with demonstrated improvements in handling large motions, significantly advance archival restoration and high-fidelity opera video synthesis.

Abstract

Chinese opera is celebrated for preserving classical art. However, early filming equipment limitations have degraded videos of last-century performances by renowned artists (e.g., low frame rates and resolution), hindering archival efforts. Although space-time video super-resolution (STVSR) has advanced significantly, applying it directly to opera videos remains challenging. The scarcity of datasets impedes the recovery of high frequency details, and existing STVSR methods lack global modeling capabilities, compromising visual quality when handling opera's characteristic large motions. To address these challenges, we pioneer a large scale Chinese Opera Video Clip (COVC) dataset and propose the Mamba-based multiscale fusion network for space-time Opera Video Super-Resolution (MambaOVSR). Specifically, MambaOVSR involves three novel components: the Global Fusion Module (GFM) for motion modeling through a multiscale alternating scanning mechanism, and the Multiscale Synergistic Mamba Module (MSMM) for alignment across different sequence lengths. Additionally, our MambaVR block resolves feature artifacts and positional information loss during alignment. Experimental results on the COVC dataset show that MambaOVSR significantly outperforms the SOTA STVSR method by an average of 1.86 dB in terms of PSNR. Dataset and Code will be publicly released.

Paper Structure

This paper contains 13 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) Visual comparison and high-frequency content ratios for the same model trained on Vimeo90K (‑V) and COVC (‑O); other methods’ ratios are reported in Appendix Section A. The COVC‑trained model recovers more high-frequency details. (b) Presents that existing methods synthesize intermediate frames with blurring artifacts.
  • Figure 2: Comparison of COVC and Vimeo samples and statistical data of COVC. Please zoom in for the best view.
  • Figure 3: Architecture of the proposed Mamba-Based multiscale fusion network. Firstly, the features are extracted, and the missing intermediate frame features are obtained by the Global Fusion Module (GFM) with a multiscale alternating scanning mechanism (MASM). Next, each frame feature is enhanced by aligning sequences of different lengths using the Multiscale Synergistic Mamba Module (MSMM). Finally, high-quality video is obtained by feature reconstruction and PixelShuffle.
  • Figure 4: Quantitative comparison with the Other Space-Time Video Super-Resolution (STVSR) methods on COVC. (a) depicts a radar plot for PSNR comparisons between all generated frames (AVG) and for interpolated frames (VFI) on the three test sets, High, Medium and Low, while (b) depicts a radar plot for SSIM. Note that all metrics have been normalized, and detailed metric results can be found in Table 3 of Appendix Section C.1.
  • Figure 5: Qualitative Comparisons of the different approaches on three qualities of Chinese opera videos, from top to bottom, for the High, Medium and Low test sets. Our framework can recover more details while producing fewer artifacts.
  • ...and 3 more figures