Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Qi Tang; Yao Zhao; Meiqin Liu; Jian Jin; Chao Yao

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Qi Tang, Yao Zhao, Meiqin Liu, Jian Jin, Chao Yao

TL;DR

Semantic Lens tackles the problem of pixel-level inter-frame misalignment in video super-resolution by introducing semantic priors derived from degraded videos. It decouples video content into global scene semantics and instance-centric tokens through a Semantic Extractor and bridges them to pixel-level features with the Semantics-Powered Attention Cross-Embedding (SPACE) block, which comprises the Global Perspective Shifter (GPS) and the Instance-Specific Semantic Embedding Encoder (ISEE); a pre-alignment module IMAGE stabilizes training. The approach yields a novel instance-centric semantic representation and a semantic bridge that improves inter-frame alignment, achieving state-of-the-art results on YTVIS-based benchmarks under challenging degradations and producing sharper textures with interpretable attribution. This work demonstrates the practical value of integrating semantic priors into VSR, enabling more robust restoration in dynamic scenes and occlusions, and points to broader opportunities for semantic-guided video restoration.

Abstract

As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that, the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

TL;DR

Abstract

Paper Structure (10 sections, 8 equations, 8 figures, 3 tables)

This paper contains 10 sections, 8 equations, 8 figures, 3 tables.

Introduction
Related Work
Method
Architecture
Semantic Extractor
Semantics-Powered Attention Cross-Embedding
Experiments
Performance Comparison
Ablation Study
Conclusion

Figures (8)

Figure 1: Top: Global semantics and instance-specific semantics. Middle: Original frames and event information (indicated by green arrows). Bottom: Patch PSNR heat map of five frames in a video, super-resolved by a single image super-resolution model. A clear boundary shows that PSNR is strongly related to video content.
Figure 2: Overall pipeline for Semantic Lens consists of a Semantic Extractor and a Pixel Enhancer. The Semantic Extractor decouples low-resolution video into instances, events, and scenes, each characterized by their embodied semantics with differentiated descriptors. These semantics are employed to enhance the pixel-level features of Pixel Enhancer in a position-embedding-like manner, which yields semantic-aware features.
Figure 3: Illustration of Semantics-Powered Attention Cross-Embedding (SPACE) Block, composed of Global Perspective Shifter (GPS) and Instance-Specific Semantic Embedding Encoder (ISEE). It is inserted before MFSAB, the basic unit of feature propagation in Pixel Enhancer, to bridge the semantic-level priors with pixel-level features.
Figure 4: Illustration of Implicit Masked Attention Guided Pre-Alignment (IMAGE) Module. Attention is conducted within a local window for pre-alignment, which is modulated by the instance masks.
Figure 5: Visual comparison of VSR ($4 \times$) on YTVIS-19 dataset.
...and 3 more figures

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

TL;DR

Abstract

Semantic Lens: Instance-Centric Semantic Alignment for Video Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (8)