Table of Contents
Fetching ...

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

Qi Tang, Yao Zhao, Meiqin Liu, Chao Yao

TL;DR

SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls, which integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames.

Abstract

Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: https://github.com/Tang1705/SeeClear-NeurIPS24.

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

TL;DR

SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls, which integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames.

Abstract

Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: https://github.com/Tang1705/SeeClear-NeurIPS24.
Paper Structure (21 sections, 22 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 22 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: The sketch of SeeClear. It consists of a Semantic Distiller and a Pixel Condenser, which are responsible for distilling instance-centric semantics from LR frames and generating HR frames. The instance-centric and assembled channel-wise semantics act as thermometer to control the condition for generation.
  • Figure 2: The illustration of Instance-Centric Alignment Module (InCAM). It utilizes the segmentation features to bridge the pixel-level information and instance-centric semantic tokens. And then, the semantic-aware features can be aligned in the semantic space based on their semantic relevance.
  • Figure 3: The illustration of Channel-wise Texture Aggregation Memory (CaTeGory). It assembles the textures based on the semantic class along the channel dimension.
  • Figure 4: Clip 011, REDS4
  • Figure 5: Clip 020, REDS4
  • ...and 12 more figures