Table of Contents
Fetching ...

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li

TL;DR

PerTouch tackles personalized, semantically-aware image retouching by marrying diffusion priors with region-level attribute control. It introduces a semantic-aware data-prep pipeline (semantic replacement and parameter perturbation), a VLM-driven agent with feedback-driven rethinking, and scene-aware memory to capture long-term preferences. The approach achieves strong region-specific edits while preserving global aesthetics, demonstrated on MIT-Adobe FiveK with ablations validating each component. Code availability supports reproducibility and practical adoption for personalized photo editing workflows.

Abstract

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component's effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

TL;DR

PerTouch tackles personalized, semantically-aware image retouching by marrying diffusion priors with region-level attribute control. It introduces a semantic-aware data-prep pipeline (semantic replacement and parameter perturbation), a VLM-driven agent with feedback-driven rethinking, and scene-aware memory to capture long-term preferences. The approach achieves strong region-specific edits while preserving global aesthetics, demonstrated on MIT-Adobe FiveK with ablations validating each component. Code availability supports reproducibility and practical adoption for personalized photo editing workflows.

Abstract

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component's effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.

Paper Structure

This paper contains 20 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of our PerTouch pipeline. Our method supports region-level personalized retouching with long-term user memory. Given images and natural language instruction, PerTouch determines the strength of the instruction, and then leverages the scene-aware memory to adaptively perform corresponding retouching operations based on the user's historical preferences. The final result is retouched by a fine-tuned diffusion model, ensuring globally pleasing and finely controlled region-level edits. The examples at the bottom demonstrate the system's ability to perform both global retouching and fine-grained regional adjustment across various instruction types.
  • Figure 2: Dataset construction and training pipeline of PerTouch. To enable region-level controllable retouching, we construct training samples by generating parameter maps that transform low-quality input images into expert-retouched ground truth results. Specifically, we 1. extract semantic masks using SAM and estimate corresponding attribute parameters for each region; 2. introduce the Semantic Replacement Module to help the model perceive semantic regions by constructing diverse yet semantically consistent samples; and 3. apply the Perturbation Mechanism to prevent overfitting to segmentation boundaries and improve overall visual quality. The final parameter maps are injected into ControlNet alongside the original images, enabling the model to balance the global aesthetic consistency provided by diffusion priors and the regional guidance from parameter maps, thereby producing high-quality region-aware retouching outputs.
  • Figure 3: Agent workflow in PerTouch. Our unified agent framework adaptively parses user instructions of varying strength. For weak instructions (e.g., “Optimize this image.”), the agent leverages scene-aware memory to retrieve long-term user preferences and generates editable parameter maps based on historical behavior. For strong instructions (e.g., “Significantly increased eagle brightness.”), the agent further adopts a feedback-driven rethinking mechanism to iteratively refine vague or unsatisfactory outputs. This adaptive instruction-following pipeline allows PerTouch to support both global and region-level personalized retouching under natural language commands.
  • Figure 4: Qualitative comparison with other methods.
  • Figure 5: Comparison with Jarvis Art.
  • ...and 2 more figures