Table of Contents
Fetching ...

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution

Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, Xiangyang Ji

TL;DR

PatchVSR tackles the resolution ceiling of pre-trained video diffusion models by enabling patch-wise video super-resolution. It introduces a dual-branch adapter with a patch-conditioned branch and a global-context branch to guide patch-level detail synthesis while preserving global coherence, plus a training-free multi-patch joint modulation to fuse overlapping patches. Built on a fixed-resolution base model, PatchVSR delivers high-quality 4K results from a 512×512 foundation, leveraging LoRA fine-tuning to adapt to patch distributions and reduce computational burden by confining self-attention to patches. Across synthetic, AI-generated, and real-world videos, PatchVSR achieves superior perceptual fidelity and temporal consistency relative to state-of-the-art methods, with notable efficiency gains over full-frame diffusion approaches.

Abstract

Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution

TL;DR

PatchVSR tackles the resolution ceiling of pre-trained video diffusion models by enabling patch-wise video super-resolution. It introduces a dual-branch adapter with a patch-conditioned branch and a global-context branch to guide patch-level detail synthesis while preserving global coherence, plus a training-free multi-patch joint modulation to fuse overlapping patches. Built on a fixed-resolution base model, PatchVSR delivers high-quality 4K results from a 512×512 foundation, leveraging LoRA fine-tuning to adapt to patch distributions and reduce computational burden by confining self-attention to patches. Across synthetic, AI-generated, and real-world videos, PatchVSR achieves superior perceptual fidelity and temporal consistency relative to state-of-the-art methods, with notable efficiency gains over full-frame diffusion approaches.

Abstract

Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.

Paper Structure

This paper contains 44 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Flowchart of our PatchVSR. Building upon a pre-trained latent T2V model, we incorporate a patch condition branch and a global context branch. These branches extract features from partitioned video patches and the resized full video together with a binary mask that indicates the location of the ROI patch, respectively. Particularly, local patch features are added to the output of each block, while the global context feature is fused with the backbone feature through newly introduced cross-attention modules (G-CA). For simplicity, we have omitted other conditional inputs such as text prompts and time steps from this diagram. The processed patches are fused via a joint modulation scheme to produce a coherent super-resolution video.
  • Figure 2: Patch Partition Visualization. The input video is divided into non-overlapping segments, as the solid blue boxes mark. For joint modulation, auxiliary patches are created, indicated by the red dashed boxes, resulting in an overlapping ratio of $50\%$.
  • Figure 3: Qualitative comparisons on SynVideo30 (test). These videos $\times 4$ super-resolution results. Zoom in for best view.
  • Figure 4: Qualitative comparisons on VideoGen30 (test). These videos are $\times 4$ super-resolution results. Zoom in for best view.
  • Figure 5: Visual evaluation on multi-patch joint modulation. Here the super-resolved videos of 2K resolution are used for comparison, where $3\times 3=9$ patches are involved. The top, middle, bottom denote input video, result of stitching latent of patches, and result of ours respectively.
  • ...and 8 more figures