Understanding Degradation with Vision Language Model
Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
TL;DR
This work reframes image degradation understanding as a hierarchical, parametric task and introduces DU-VLM, a Vision-Language Model that predicts a structured degradation description $(t,k,v)$ through autoregressive next-token prediction. It unifies taxonomy, keys, and continuous values under a single objective, supported by theoretical error bounds and a new large-scale benchmark, DU-110k. The approach combines multimodal chain-of-thought, frequency and edge cues, and a restoration-prior (diffusion) with offline and online reinforcement learning to enforce physical consistency and generalization. Empirical results show superior degradation understanding and zero-shot diffusion-based restoration across diverse degradations, with robust generalization to real-world data. This work bridges semantic reasoning and physical image formation models, enabling interpretable and controllable restoration in practical applications.
Abstract
Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.
