Table of Contents
Fetching ...

SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, Yun Fu

TL;DR

This work introduces SimpleCall, a lightweight image restoration agent trained in a label-free setting using multimodal perceptual feedback from multimodal LLMs. It formulates restoration as a sequential decision process over a discrete tool library and optimizes with an actor–critic policy using PPO-style clipping, guided by DeQA-Score perceptual rewards. The method achieves competitive full-reference performance without supervision and surpasses baselines on no-reference metrics, while offering constant, one-pass inference across degradation settings. The results demonstrate robust generalization to unseen degradation mixtures and highlight perception-based supervision as a scalable approach for autonomous restoration.

Abstract

Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

TL;DR

This work introduces SimpleCall, a lightweight image restoration agent trained in a label-free setting using multimodal perceptual feedback from multimodal LLMs. It formulates restoration as a sequential decision process over a discrete tool library and optimizes with an actor–critic policy using PPO-style clipping, guided by DeQA-Score perceptual rewards. The method achieves competitive full-reference performance without supervision and surpasses baselines on no-reference metrics, while offering constant, one-pass inference across degradation settings. The results demonstrate robust generalization to unseen degradation mixtures and highlight perception-based supervision as a scalable approach for autonomous restoration.

Abstract

Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

Paper Structure

This paper contains 22 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) Existing restoration agents RestoreAgentagenticirhybridagent typically consist of assessment, scheduling, execution, reflection, and rollback, using VLMs for degradation recognition and LLMs for plan making; (b) Our SimpleCall agent determines the tool-calling sequence via a single policy execution, avoids the need for iterative trial-and-error, and generalizes to label-free environments.
  • Figure 2: Framework overview. The restoration agent predicts the next action based on the current input status (sampling actions during training while selecting the highest-probability action during inference, see Sec. \ref{['ira_']}). The environment executes the chosen action, evaluates the restored output with an MLLM, and returns a feedback signal to update the agent's policy (see Sec. \ref{['LF_ENV']}). Through this iterative interaction, the agent progressively refines its decision-making policy without ground-truth supervision.
  • Figure 3: Qualitative comparison between our method and SOTA restoration baselines (for other baselines see the supplementary material).
  • Figure 4: Runtime comparison between ours and AgenticIR agenticir.
  • Figure 5: Illustration of tool effects. Left: images with dark degradation (tiger: motion blur+dark, panda: dark+noise; skyscraper: rain+dark). Right: outputs from the dehazing model RIDCP RIDCP.
  • ...and 3 more figures