Table of Contents
Fetching ...

PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu

TL;DR

PhotoDoodle presents a diffusion-transformer-based framework for artistic photo editing that learns artist-specific doodling from minimal paired data. A two-stage process—OmniEditor pretraining on large editing corpora and EditLoRA fine-tuning on 30-50 examples—coupled with Position Encoding Cloning and noise-free conditioning achieves precise, background-consistent edits. The approach is validated on general and customized tasks, outperforming baselines in both qualitative and quantitative evaluations, and a new six-style PhotoDoodle dataset is released to support future work. This framework enables scalable, instruction-driven artistic editing with strong preservation of background content. The work advances practical tools for personalized photo augmentation and provides a benchmark for reproducible research in stylized image editing.

Abstract

We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist's unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.

PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

TL;DR

PhotoDoodle presents a diffusion-transformer-based framework for artistic photo editing that learns artist-specific doodling from minimal paired data. A two-stage process—OmniEditor pretraining on large editing corpora and EditLoRA fine-tuning on 30-50 examples—coupled with Position Encoding Cloning and noise-free conditioning achieves precise, background-consistent edits. The approach is validated on general and customized tasks, outperforming baselines in both qualitative and quantitative evaluations, and a new six-style PhotoDoodle dataset is released to support future work. This framework enables scalable, instruction-driven artistic editing with strong preservation of background content. The work advances practical tools for personalized photo augmentation and provides a benchmark for reproducible research in stylized image editing.

Abstract

We introduce PhotoDoodle, a novel image editing framework designed to facilitate photo doodling by enabling artists to overlay decorative elements onto photographs. Photo doodling is challenging because the inserted elements must appear seamlessly integrated with the background, requiring realistic blending, perspective alignment, and contextual coherence. Additionally, the background must be preserved without distortion, and the artist's unique style must be captured efficiently from limited training data. These requirements are not addressed by previous methods that primarily focus on global style transfer or regional inpainting. The proposed method, PhotoDoodle, employs a two-stage training strategy. Initially, we train a general-purpose image editing model, OmniEditor, using large-scale data. Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques. To enhance consistency in the generated results, we introduce a positional encoding reuse mechanism. Additionally, we release a PhotoDoodle dataset featuring six high-quality styles. Extensive experiments demonstrate the advanced performance and robustness of our method in customized image editing, opening new possibilities for artistic creation.

Paper Structure

This paper contains 20 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: PhotoDoodle can mimic the styles and techniques of human artists in creating photo doodles, adding decorative elements to photos while maintaining perfect consistency between the pre- and post-edit states.
  • Figure 2: The overall architecture and training prodigim of photodoodle. The ominiEditor and EditLora all follow the lora training prodigm. We use a high rank lora for pre-training the OmniEditor on a large-scale dataset for general-purpose editing and text-following capabilities, and a low rank lora for fine-tuning EditLoRA on a small set of paired stylized images to capture individual artists’ specific styles and strategies for efficient customization. We encode the source image into a condition token and concatenate it with a noised latent token, controlling the generation outcome through MMAttention.
  • Figure 3: The generated results of PhotoDoodle. PhotoDoodle can mimic the manner and style of artists creating photo doodles, enabling instruction-driven high-quality image editing.
  • Figure 4: Compared to baselines, PhotoDoodle demonstrates superior instruction following, image consistency, and editing effectiveness.
  • Figure 5: Ablation study results.
  • ...and 2 more figures