TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

Hua Chang; Xin Xu; Wei Liu; Jiayi Wu; Kui Jiang; Fei Ma; Qi Tian

TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

Hua Chang, Xin Xu, Wei Liu, Jiayi Wu, Kui Jiang, Fei Ma, Qi Tian

Abstract

Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Experiments on our OperaLQ benchmark show that TextOVSR outperforms state-of-the-art methods both qualitatively and quantitatively. The code is available at https://github.com/ChangHua0/TextOVSR.

TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

Abstract

Paper Structure (22 sections, 4 equations, 10 figures, 5 tables)

This paper contains 22 sections, 4 equations, 10 figures, 5 tables.

Introduction
Related Work
Video Super-Resolution
Real-World Video Super-Resolution
Text-guided Video Super-Resolution
Method
Description Text Generation
TextOVSR
Degradation-Robust feature Fusion Module
Text-Enhanced Discriminator
Objective Functions
Experiments
Datasets and Metrics
Implementation Details
Comparison with State-of-the-Arts
...and 7 more sections

Figures (10)

Figure 1: Frameworks for real-world video super-resolution. (a) The classical synthetic degradation pipeline (D) simulates real degradations for the VSR model. (b) Real degradation modeling extracts authentic noise from external datasets and applies a negative constraint ($\mathcal{L}_{neg}$) to enhance robustness. (c) Our proposed TextOVSR introduces text-guided priors to enrich image features and model diverse degradations in the feature space. $\lambda$ controls the mixing ratio.
Figure 2: Generation process of degradation- and content-descriptive texts. Degradation-descriptive text is generated according to different intensity levels in the high-order degradation pipeline, while content-descriptive text is produced directly from high-resolution inputs (HRs) using a multimodal large language model (MLLM), rather than from degraded low-resolution videos (LRs).
Figure 3: The proposed TextOVSR network and TED adopt a two-stage training scheme. In the first stage, only TextOVSR is trained. The positive branch (blue) takes content-descriptive text ($T_C$) and degraded videos ($V_{lr}$) as input, while the negative branch (red) takes degradation-descriptive text ($T_D$) and mixed-noise videos ($\widetilde{V}_{lr}$). Text features are extracted using a frozen CLIP encoder and fused with image features through the proposed DRF module. In the second stage, TextOVSR serves as the generator and TED as the discriminator. Adversarial training refines texture realism by selecting reliable textual features and integrating them with reconstructed image features. Here, $t-1$, $t$, and $t+1$ denote three consecutive frames, with detailed propagation described in \ref{['TextOVSR']}.
Figure 4: The proposed Degradation-Robust Feature Fusion (DRF) module.$F_{I}^{t+1}$ and $F_{T}^{t+1}$ denote the image and text feature vectors, respectively, and $M^{t}$ represents the fused feature.
Figure 5: OperaLQ Dataset. Our OperaLQ dataset consists of real degraded opera videos with varying content and resolutions.
...and 5 more figures

TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

Abstract

TextOVSR: Text-Guided Real-World Opera Video Super-Resolution

Authors

Abstract

Table of Contents

Figures (10)