Table of Contents
Fetching ...

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu

Abstract

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Abstract

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.
Paper Structure (19 sections, 12 equations, 14 figures, 4 tables)

This paper contains 19 sections, 12 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Ultra-resolution results generated by our Methods built upon Wan2.2 wan2025wanopenadvancedlargescale. Resolution is marked on the top-right corner of each result in the format of width$\times$height. Corresponding prompts can be found in the appendix.
  • Figure 2: The two plots show how training VRAM consumption and iteration time scale with the number of frames under different resolutions.
  • Figure 3: The plots on the left shows the general pipeline of our Relay LoRA. The two panels on the right provide a qualitative analysis on Wan2.2 wan2025wanopenadvancedlargescale: naive high-resolution image fine-tuning (Naive LoRA) introduces noticeable artifacts, whereas ours methods (Relay LoRA) effectively removes them.
  • Figure 4: Upper Left: Stage-1 fine-tuning trains $\mathrm{LoRA}_1$ to adapt the DiT backbone to single low-resolution frame generation. Upper Middle: Stage-2 first merges $\mathrm{LoRA}_1$ into the base model and freezes the merged weights, then trains a Relay $\mathrm{LoRA}_2$ to adapt the DiT backbone to single high-resolution frame generation. Upper Right: During inference, only $\mathrm{LoRA}_2$ is loaded on the base model for video generation. Bottom Left: Global-Coarse-Local-Fine-Attention combines an inward sliding-window local attention with pooled coarse attention. Bottom Right: The training objective first degrades the latents via a downsample--upsample operation, then add noise on the degraded sample. The model applies the predicted flow and computes the loss against the clean latents.
  • Figure 5: Qualitative comparison. ViBe yields high-resolution videos characterized by high-fidelity details and coherent structure. The red boxes highlight regions with incorrect semantics or layout. The blue boxes provide zoomed-in views.
  • ...and 9 more figures