RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models
Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng
TL;DR
RetouchLLM tackles the need for flexible, interpretable image retouching without reliance on large paired datasets. It introduces a training-free, white-box pipeline that iteratively refines high-resolution images using a visual critic (VLM) and a code generator (LLM) to produce executable editing programs. A CLIP-based, KL-divergence style selection score guides each iteration, enabling stable convergence toward a target style with multiple reference images and no training data. The system supports natural language interaction for personalized, fine-grained edits and demonstrates strong generalization across diverse retouching styles and backbones. This approach offers transparent, reusable editing pipelines suitable for real-world applications while preserving image fidelity and enabling user-guided refinement.
Abstract
Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.
