Table of Contents
Fetching ...

Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, Anh Totti Nguyen

TL;DR

This work introduces PSR, the largest dataset of real-world image-editing requests paired with human edits, annotated with WordNet-based subjects, editing actions, and creativity levels to reveal real-user needs. Through a large-scale comparison of multiple AI editors against human edits and evaluation by vision-language models, the study finds that humans still outperform AI in typical requests, while VLM-based judgments often diverge from human judgments. AI edits tend to improve aesthetics and can handle about one-third of requests, with performance varying by action type and creativity. The results highlight critical gaps in identity preservation and instruction adherence, offering concrete directions for improving AI editors and evaluation methodologies in practical image editing scenarios.

Abstract

Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: https://psrdataset.github.io

Understanding Generative AI Capabilities in Everyday Image Editing Tasks

TL;DR

This work introduces PSR, the largest dataset of real-world image-editing requests paired with human edits, annotated with WordNet-based subjects, editing actions, and creativity levels to reveal real-user needs. Through a large-scale comparison of multiple AI editors against human edits and evaluation by vision-language models, the study finds that humans still outperform AI in typical requests, while VLM-based judgments often diverge from human judgments. AI edits tend to improve aesthetics and can handle about one-third of requests, with performance varying by action type and creativity. The results highlight critical gaps in identity preservation and instruction adherence, offering concrete directions for improving AI editors and evaluation methodologies in practical image editing scenarios.

Abstract

Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: https://psrdataset.github.io

Paper Structure

This paper contains 30 sections, 58 figures, 22 tables.

Figures (58)

  • Figure 1: We propose PSR, the largest dataset of real-world image-editing requests and human-made edits. PSR enables the community (and our work) to identify types of requests that can be automated using existing AIs and those that need improvement. PSR is the first dataset to tag all requests with WordNet subjects, real-world editing actions, and creativity levels.
  • Figure 2: Example cases from PSR dataset where PSR wizard edits were preferred by human raters over the AI edits (a-c) and samples where AI edits was preferred (d-f). (a): The human edit completes the request, but the AI edit removes the people, which was not requested. (b): The human edit completes the request, but the AI edit generates a similar image with people resembling those in the source image, although with different identities. (c): The human edit completes the request, but the AI edit does not because the finger does not go through the nose and out the eye. (d): Both edits make the requested removals, but the AI edit makes the image sharper and adds a house in the background (which was not requested). (e): The human edit reduces the color cast but leaves behind a bluish tint and muted tones, while the AI edit successfully restores realistic colors and contrast. (f): The human edit removes the background and applies a soft photo filter that lacks stylization, while the AI edit transforms the dogs into bold, clean cartoons. More results are available at: https://huggingface.co/spaces/PSRDataset/PSR-Battle-Results
  • Figure 3: Over 12 years of Reddit data, delete, adjust, and add are the top-3 most wanted actions (a). Specifically, humans, body parts, text, and pets are the most frequent WordNet subjects for such common actions (b). While delete and adjust are the top-2 most common actions in the low- and medium-creativity requests, add takes up the largest share (34.9%) in high-creativity, e.g., inserting some "interesting" background or objects into the scene (c). Most requests (55.5%) require straightforward edits with low creativity (d).
  • Figure 4: AIs can significantly alter both a person’s identity and the overall image quality through iterative image-editing requests. GPT-4o was repeatedly instructed to modify the shirt color, with each step's output serving as the input for the next iteration. Over iterations, facial identity, body shape, and the background shift away from the original person (more details in \ref{['sec:appendix-face_diff']}).
  • Figure 5: o1 judge occasionally fails to notice details in edited images, here, overlooking the position of the hand and the configuration of the fingers.
  • ...and 53 more figures