Table of Contents
Fetching ...

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Shian Du, Menghan Xia, Chang Liu, Quande Liu, Xintao Wang, Pengfei Wan, Xiangyang Ji

TL;DR

UniMMVSR addresses the challenge of high-resolution, multi-modal video generation by decoupling content synthesis and detail refinement in a cascaded latent-diffusion framework. It introduces a unified conditioning scheme that seamlessly injects text, multiple ID images, and reference videos through channel and token concatenation, along with separated RoPE to preserve correlations across modalities. A novel SDEdit-based degradation pipeline and a difficulty-to-easier training curriculum enhance robustness to base-model artifacts and enable cross-task data transfer, culminating in scalable 4K generation with high fidelity to multi-modal guidance. Empirical results show state-of-the-art perceptual quality and reference fidelity across text-to-video, multi-ID image-guided, and editing tasks, validating the approach and its potential for practical, controllable ultra-high-resolution video synthesis.

Abstract

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

TL;DR

UniMMVSR addresses the challenge of high-resolution, multi-modal video generation by decoupling content synthesis and detail refinement in a cascaded latent-diffusion framework. It introduces a unified conditioning scheme that seamlessly injects text, multiple ID images, and reference videos through channel and token concatenation, along with separated RoPE to preserve correlations across modalities. A novel SDEdit-based degradation pipeline and a difficulty-to-easier training curriculum enhance robustness to base-model artifacts and enable cross-task data transfer, culminating in scalable 4K generation with high fidelity to multi-modal guidance. Empirical results show state-of-the-art perceptual quality and reference fidelity across text-to-video, multi-ID image-guided, and editing tasks, validating the approach and its potential for practical, controllable ultra-high-resolution video synthesis.

Abstract

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.

Paper Structure

This paper contains 45 sections, 3 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: UniMMVSR is a unified framework that supports video super-resolution with multi-modal input conditions. By cooperating with the low-resolution multi-modal generative model, the proposed cascaded framework can effectively extend the controllable video generation to ultra-high-resolution (e.g., 4K) with high visual quality and subject consistency.
  • Figure 2: Overview of UniMMVSR in the context of a cascaded generation framework. Upsampler denotes the sequential operations of VAE decoding, upscaling via bilinear interpolation, and VAE encoding. TC and CC denote token concatenation and channel concatenation respectively. Texts are encoded by text encoder and then injected via cross-attention layers, which are omit for simplicity.
  • Figure 3: Qualitative comparisons on text-to-video generation, text-guided video editing and multi-ID image-guided text-to-video generation tasks from top to bottom. (Zoom-in for best view)
  • Figure 4: Visual Comparisons of single-task and unified model. Zoom-in for best view.
  • Figure 5: Qualitative results of 4K multi-ID image-guided text-to-video generation.
  • ...and 14 more figures