Table of Contents
Fetching ...

Progressive Image Restoration via Text-Conditioned Video Generation

Peng Kang, Xijun Wang, Yu Yuan

TL;DR

This work reframes image restoration as progressive video generation conditioned on text, and demonstrates that CogVideo can learn restoration trajectories for super-resolution, deblurring, and low-light enhancement. By fine-tuning with LoRA on three progression datasets and comparing uniform versus scene-adaptive prompts, the approach achieves improved perceptual metrics and temporal coherence. The model generalizes to real-world motion blur in ReLoBlur without extra training, underscoring robustness and transferability. Overall, the paper introduces a unified, interpretable paradigm that leverages temporal diffusion priors to perform cross-task restoration within a single generative framework.

Abstract

Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

Progressive Image Restoration via Text-Conditioned Video Generation

TL;DR

This work reframes image restoration as progressive video generation conditioned on text, and demonstrates that CogVideo can learn restoration trajectories for super-resolution, deblurring, and low-light enhancement. By fine-tuning with LoRA on three progression datasets and comparing uniform versus scene-adaptive prompts, the approach achieves improved perceptual metrics and temporal coherence. The model generalizes to real-world motion blur in ReLoBlur without extra training, underscoring robustness and transferability. Overall, the paper introduces a unified, interpretable paradigm that leverages temporal diffusion priors to perform cross-task restoration within a single generative framework.

Abstract

Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

Paper Structure

This paper contains 10 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the CogVideo fine-tuning process for image restoration. The model receives a video showing progressive visual enhancement (e.g., from low-resolution to super-resolution) and a corresponding textual prompt describing the scene and restoration dynamics. Using LoRA fine-tuning, CogVideo learns to align temporal visual improvements with text semantics, enabling it to generate restoration sequences conditioned on both the input frame and the prompt.
  • Figure 2: Frame-wise restoration performance across three enhancement tasks: (a) super-resolution, (b) deblurring, and (c) low-light enhancement. For each task, PSNR and SSIM generally increase while LPIPS decreases as frames progress, confirming that the model learns temporal enhancement dynamics. The various-prompt version consistently provides slightly higher perceptual quality and smoother progression, highlighting the benefit of scene-aware textual conditioning.
  • Figure 3: Qualitative restoration results across three enhancement tasks using the various-prompt fine-tuned CogVideo. Each row visualizes frames 1, 3, 5, 7, and 9 to illustrate the temporal restoration trajectory. The model demonstrates smooth progression, gradually recovering fine details, edges, and illumination while preserving global structure and perceptual consistency.
  • Figure 4: Qualitative example from the ReLoBlur dataset focusing on a moving football scene. The fine-tuned CogVideo progressively restores the sharp structure and texture of the football while reducing surrounding motion streaks and background smear. From frame 1 to frame 9, the football becomes increasingly clear and well-defined, demonstrating the model’s ability to recover localized high-frequency motion details and maintain temporal consistency without over-sharpening artifacts.