Table of Contents
Fetching ...

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun

Abstract

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Abstract

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

Paper Structure

This paper contains 41 sections, 17 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Top: average scores of commercial and open-source multimodal models on clean versus degraded inputs from MMD-Bench across six benchmarks. All models show substantial performance drops under degradation. Bottom: comparison between existing multimodal models and CLEAR on a degraded image.
  • Figure 2: Overview of CLEAR. Stage 1 (top) performs supervised fine-tuning to establish the generate-then-answer reasoning pattern and warm-start the Latent Representation Bridge, with both VAE latent and ViT re-encoded features injected during this stage. Stage 2 (bottom) applies Interleaved GRPO, where text tokens are optimized with GRPO and the denoising step with Flow-GRPO, sharing the same group-relative advantage from answer-correctness rewards. The ViT path is removed in Stage 2, making the bridge the sole connection between generation and reasoning.
  • Figure 3: Left: the standard decode-reencode path in existing unified models. The generated VAE latent must be decoded into pixels and re-encoded through the ViT before it can enter the reasoning context. Right: the Latent Representation Bridge in CLEAR. The generated VAE latent is directly concatenated into the reasoning context, eliminating the decode-reencode bottleneck and providing an effective optimization route from answer correctness back to generation.
  • Figure 4: Qualitative examples of CLEAR's adaptive reasoning. Left: on a mildly noisy image, the model skips generation and answers directly. Right: on a severely blurred image, the model triggers generation to recover obscured details before answering.
  • Figure 5: Generation triggering rate (bars, left axis) and total inference time (line, right axis) across degradation severity levels for each benchmark.
  • ...and 8 more figures