Table of Contents
Fetching ...

V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

Shenghe Zheng, Junpeng Jiang, Wenbo Li

Abstract

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

Abstract

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.
Paper Structure (23 sections, 8 equations, 10 figures, 6 tables)

This paper contains 23 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Left: Image restoration is formulated as progressive video generation with frame drift correction. Right: Leveraging video generative priors leads to stronger generalization under limited data compared to current image restoration method foundir.
  • Figure 2: Overview of the proposed pipeline. The upper part shows data construction and training, where paired low- and high-quality images are used to build pseudo-temporal sequences for progressive restoration learning. A progressive resolution training strategy is adopted to improve fine-grained detail modeling, and an auxiliary generative model is trained for final-frame correction. The lower part shows inference, where the model generates a restoration trajectory and uses the refined last frame as the final output.
  • Figure 3: Visualization results on a subset of the FoundIR test set. FoundIR-G is the generalist model of FoundIR. GT denotes ground truth. Bounding boxes with different colors indicate zoomed regions for detailed comparison. Compared with other methods, our approach achieves higher visual fidelity and stronger structural consistency, while showing superior robustness across diverse degradation patterns.
  • Figure 4: Performance improvement on the test dataset of FoundIR brought by few-shot training.
  • Figure 5: (a) Comparison on the FoundIR test set with and without the refine model. GT denotes ground truth. Correction improves visual quality and enhances fine details. (b) Ablation study results. Top: effect of different training frame numbers for image restoration. Bottom: performance improves as training data scale increases.
  • ...and 5 more figures