Table of Contents
Fetching ...

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

Mingde Yao, Zhiyuan You, Tam-King Man, Menglu Wang, Tianfan Xue

TL;DR

This work presents PhotoAgent, a system that advances image editing through explicit aesthetic planning, and introduces UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model to support reliable evaluation in real-world scenarios.

Abstract

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

TL;DR

This work presents PhotoAgent, a system that advances image editing through explicit aesthetic planning, and introduces UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model to support reliable evaluation in real-world scenarios.

Abstract

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.
Paper Structure (27 sections, 9 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: PhotoAgent autonomously performs high-level, semantically meaningful edits aligned with human aesthetic, moving beyond low-level color, contrast, or illumination tweaks. Upper-Left: People-loop, where users iteratively inspect the image, propose edits, and apply changes until satisfied. Upper-Right: PhotoAgent, where the process runs autonomously. Bottom: Edited photos. Note that PhotoAgent also supports user-guided editing (Fig. \ref{['fig:user_prompt']}).
  • Figure 2: Detailed loop of PhotoAgent. First, Perceiver extracts semantic cues from the current image and proposes $N$ candidate editing actions. Second, Planner explores the candidate actions through iterative rollouts, scoring, and pruning to progressively refine edits and select the action that achieves the optimal result. Then, the executor applies these edits while the evaluator scores intermediate results, invoking re-planning when the score is unsatisfactory.
  • Figure 3: Pipeline for constructing the UGC-Edit Dataset and training reward model. We start with a diverse pool of source images from LAION schuhmann2022laion and RealQA li2025next. Each image is processed through a structured prompt with Qwen3-VL wu2025qwenimagetechnicalreport for UGC classification. The images are then filtered by human annotators. Finally, a reward model is trained via GRPO shao2024deepseekmath to predict fine-grained quality scores.
  • Figure 4: Qualitative results. PhotoAgent generates visually pleasing edits by autonomously improving color harmony, composition, and aesthetic expressiveness, often introducing a stronger sense of visual dynamics and atmosphere. Baseline methods tend to produce incomplete or less coherent outputs.
  • Figure 5: The editing process of our PhotoAgent over three iterations.
  • ...and 4 more figures