Table of Contents
Fetching ...

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

Wei Dong, Han Zhou, Junwei Lin, Jun Chen

TL;DR

VAR-LIDE tackles real-world low-light restoration with simultaneous deblurring under unsupervised learning. It advances a Visual Autoregressive (VAR) backbone conditioned by Vision-Language Model (VLM) priors, enabling adaptive illumination and blur-aware generation. Key innovations include a VLM-Informed Conditioning Module (VICM) for perceptual-driven illumination, Content-Aware Spatial-Frequency RoPE (SF-RoPE) for structure preservation under blur, and a Recursive Phase Modulation (VGPM) that refines FFT phase with blur guidance, all trained with a reference-free objective using Adaptive Exposure, Structural Entropy, Structural Contrast, and Total Variation losses. The approach achieves state-of-the-art performance on LOLBlur and Real-LOLBlur, demonstrating strong generalization and practical applicability for real-world scenarios.

Abstract

Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores. In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.

Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation

TL;DR

VAR-LIDE tackles real-world low-light restoration with simultaneous deblurring under unsupervised learning. It advances a Visual Autoregressive (VAR) backbone conditioned by Vision-Language Model (VLM) priors, enabling adaptive illumination and blur-aware generation. Key innovations include a VLM-Informed Conditioning Module (VICM) for perceptual-driven illumination, Content-Aware Spatial-Frequency RoPE (SF-RoPE) for structure preservation under blur, and a Recursive Phase Modulation (VGPM) that refines FFT phase with blur guidance, all trained with a reference-free objective using Adaptive Exposure, Structural Entropy, Structural Contrast, and Total Variation losses. The approach achieves state-of-the-art performance on LOLBlur and Real-LOLBlur, demonstrating strong generalization and practical applicability for real-world scenarios.

Abstract

Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges. Existing methods often rely on paired data or fail to model dynamic illumination and blur characteristics, leading to poor generalization. To tackle this, we propose a generative framework based on visual autoregressive (VAR) modeling, guided by perceptual priors from the vision-language model (VLM). Specifically, to supply informative conditioning cues for VAR models, we deploy an adaptive curve estimation scheme to modulate the diverse illumination based on VLM-derived visibility scores. In addition, we integrate dynamic and spatial-frequency-aware Rotary Positional Encodings (SF-RoPE) into VAR to enhance its ability to model structures degraded by blur. Furthermore, we propose a recursive phase-domain modulation strategy that mitigates blur-induced artifacts in the phase domain via bounded iterative refinement guided by VLM-assessed blur scores. Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.

Paper Structure

This paper contains 30 sections, 12 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 2: The overall framework of our proposed VAR-LIDE method, which adopts the pre-trained VAR model varsr as the backbone. We first leverage the perceptual priors extraction pipeline gppllie to acquire visibility-aware and blurriness-aware scores ($v$ and $b$). Then, $v$ is integrated into our VLM-Informed Conditioning Module (VICM) to adaptively improve the visibility and further support informative cues for VAR modeling. Moreover, to generate content-aware representations of positional embeddings, we develop the spatial-frequency rotary positional encodings (SF-RoPE) in VAR transformer blocks. Finally, guided by the VLM assessment $b$, we introduce a recursive modulation mechanism (VGPM) in the FFT phase domain to further mitigate blurriness and achieve visually compelling outputs.
  • Figure 3: The overall framework of our VICM. It estimates illumination curves and adaptively truncates them based on a visibility-aware iteration count $n_v$.
  • Figure 5: Visual comparisons on the LOL-Blur dataset, which involves both severe low-light conditions and motion blur. Compared to existing methods, our approach better preserves fine details and improves perceptual quality across diverse scenes.
  • Figure 6: Our VGPM progressively refines the phase representation to mitigate ghosting artifacts introduced by blur.
  • Figure 7: Visual comparisons on the Real-LOLBlur dataset. Our method restores natural illumination and achieves superior deblurring performance with sharper edges and clearer structures, showing strong generalization to complex real-world scenes.
  • ...and 6 more figures