Table of Contents
Fetching ...

RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models

Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng

TL;DR

RetouchLLM tackles the need for flexible, interpretable image retouching without reliance on large paired datasets. It introduces a training-free, white-box pipeline that iteratively refines high-resolution images using a visual critic (VLM) and a code generator (LLM) to produce executable editing programs. A CLIP-based, KL-divergence style selection score guides each iteration, enabling stable convergence toward a target style with multiple reference images and no training data. The system supports natural language interaction for personalized, fine-grained edits and demonstrates strong generalization across diverse retouching styles and backbones. This approach offers transparent, reusable editing pipelines suitable for real-world applications while preserving image fidelity and enabling user-guided refinement.

Abstract

Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.

RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models

TL;DR

RetouchLLM tackles the need for flexible, interpretable image retouching without reliance on large paired datasets. It introduces a training-free, white-box pipeline that iteratively refines high-resolution images using a visual critic (VLM) and a code generator (LLM) to produce executable editing programs. A CLIP-based, KL-divergence style selection score guides each iteration, enabling stable convergence toward a target style with multiple reference images and no training data. The system supports natural language interaction for personalized, fine-grained edits and demonstrates strong generalization across diverse retouching styles and backbones. This approach offers transparent, reusable editing pipelines suitable for real-world applications while preserving image fidelity and enabling user-guided refinement.

Abstract

Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.

Paper Structure

This paper contains 42 sections, 12 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: Overview of our training-free white-box photo adjustment system. Given a source image and style reference images, the visual critic gives multiple candidates of difference descriptions, and the code generator produces corresponding adjustment programs. The best candidate is selected according to the selection score, set as the new source, and the process iteratively continues until the stopping criterion is reached. The dashed box (GT Adjusted Image) is reference-only, outside the pipeline. Only dark/bright are shown for brevity, though eight prompts were used.
  • Figure 2: Quantitative results over 10 iterations. All metrics show consistent improvement over iterations. Higher PSNR and SSIM, and lower LPIPS and $\Delta$E, indicate closer similarity to the GT.
  • Figure 3: Qualitative results of progressively retouched images. In each row, the leftmost image is the source, the rightmost is the GT, and the middle images show the progressively retouched results.
  • Figure 4: Applying the restored filter. The paired setup enables extracting a more faithful and reusable retouching code that can be applied to other images like a preset filter.
  • Figure 5: User interactive retouching. The user gives instructions to retouch images towards the desired style. These retouched images can then be fed back into the pipeline for further retouching.
  • ...and 12 more figures