Table of Contents
Fetching ...

Understanding Degradation with Vision Language Model

Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

TL;DR

This work reframes image degradation understanding as a hierarchical, parametric task and introduces DU-VLM, a Vision-Language Model that predicts a structured degradation description $(t,k,v)$ through autoregressive next-token prediction. It unifies taxonomy, keys, and continuous values under a single objective, supported by theoretical error bounds and a new large-scale benchmark, DU-110k. The approach combines multimodal chain-of-thought, frequency and edge cues, and a restoration-prior (diffusion) with offline and online reinforcement learning to enforce physical consistency and generalization. Empirical results show superior degradation understanding and zero-shot diffusion-based restoration across diverse degradations, with robust generalization to real-world data. This work bridges semantic reasoning and physical image formation models, enabling interpretable and controllable restoration in practical applications.

Abstract

Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.

Understanding Degradation with Vision Language Model

TL;DR

This work reframes image degradation understanding as a hierarchical, parametric task and introduces DU-VLM, a Vision-Language Model that predicts a structured degradation description through autoregressive next-token prediction. It unifies taxonomy, keys, and continuous values under a single objective, supported by theoretical error bounds and a new large-scale benchmark, DU-110k. The approach combines multimodal chain-of-thought, frequency and edge cues, and a restoration-prior (diffusion) with offline and online reinforcement learning to enforce physical consistency and generalization. Empirical results show superior degradation understanding and zero-shot diffusion-based restoration across diverse degradations, with robust generalization to real-world data. This work bridges semantic reasoning and physical image formation models, enabling interpretable and controllable restoration in practical applications.

Abstract

Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
Paper Structure (37 sections, 2 theorems, 26 equations, 11 figures, 9 tables)

This paper contains 37 sections, 2 theorems, 26 equations, 11 figures, 9 tables.

Key Result

Proposition 4.1

Given the NTP loss $\mathcal{L}_{\text{NTP}} = -\log p_\theta(t, k, z | \bm{x})$. Assume (i) the conditional density of the continuous value $p(v|t,k,\bm{x})$ follows a local Gaussian distribution $\mathcal{N}(\mu_\theta, \sigma^2\mathbf{I})$, and (ii) the quantization grid $\Delta$ is sufficiently where $C = -\log \Delta + \text{const}$ is independent of $\theta$.

Figures (11)

  • Figure 1: Comparison of degradation understanding paradigms. (Left) Latent embedding approaches. (Middle) Free-form text description methods. (Right) Our DU-VLM, which explicitly predicts a hierarchical tuple, providing physically interpretable parameters to directly guide restoration.
  • Figure 2: The construction pipeline of the DU-110k benchmark. We employ a hybrid Simulation-Verification strategy. (Top) Physics-based models synthesize initial clean-degraded pairs with human verification to ensure realism. (Bottom) Examples of the degradation categories alongside specific physical parameters.
  • Figure 3: Overview of the DU-VLM framework. The top inference phase leverages multimodal inputs and Chain-of-Thought reasoning to predict hierarchical parameters for guiding restoration. The bottom training pipeline progresses from Supervised Fine-Tuning to Offline Structured RL to enforce physical consistency, followed by Online Self-supervised RL for open-world adaptation.
  • Figure 4: Quantitative comparison using Radar Charts on three metrics: (a) Top-1 Accuracy, (b) F1-score, and (c) Joint Type-Key Accuracy. The visualization covers four conditions (Night, Haze, Blur, Low Resolution) and the Average performance. Our method (highlighted in red) demonstrates robust performance across all scenarios.
  • Figure 5: Qualitative comparison of image restoration results. Our approach (rightmost column) produces cleaner and more visually pleasing results compared to baselines, effectively removing complex degradations while preserving semantic details.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 4.1: Equivalence of Objectives
  • Proposition 4.2: Excess Risk Bounds
  • proof
  • proof