Responsible Visual Editing
Minheng Ni, Yeli Shen, Lei Zhang, Wangmeng Zuo
TL;DR
The paper defines responsible visual editing as minimally altering an image to remove or mitigate risky concepts across safety, fairness, and privacy. It introduces CoEditor, a two-stage cognitive framework that uses PCP to locate modification targets and BCP to plan edits, leveraging a large multimodal model (GPT-4v) and Semantic-SAM for guidance, and applies Stable Diffusion Inpainting for final edits. To support evaluation, it presents AltBear, a safe teddy-bear–based dataset that mirrors real harmful content, along with machine and human metrics for success and visual similarity, and markers to prevent misuse. Experimental results show CoEditor substantially surpasses baselines (InstructPix2pix and InstructDiffusion) in both quantitative and qualitative assessments, including general editing, with strong consistency between AltBear and real data. The work also emphasizes reproducibility, anti-misuse safeguards, and ethical considerations, while acknowledging higher computational costs as a limitation and future direction for efficiency.
Abstract
With recent advancements in visual synthesis, there is a growing risk of encountering images with detrimental effects, such as hate, discrimination, or privacy violations. The research on transforming harmful images into responsible ones remains unexplored. In this paper, we formulate a new task, responsible visual editing, which entails modifying specific concepts within an image to render it more responsible while minimizing changes. However, the concept that needs to be edited is often abstract, making it challenging to locate what needs to be modified and plan how to modify it. To tackle these challenges, we propose a Cognitive Editor (CoEditor) that harnesses the large multimodal model through a two-stage cognitive process: (1) a perceptual cognitive process to focus on what needs to be modified and (2) a behavioral cognitive process to strategize how to modify. To mitigate the negative implications of harmful images on research, we create a transparent and public dataset, AltBear, which expresses harmful information using teddy bears instead of humans. Experiments demonstrate that CoEditor can effectively comprehend abstract concepts within complex scenes and significantly surpass the performance of baseline models for responsible visual editing. We find that the AltBear dataset corresponds well to the harmful content found in real images, offering a consistent experimental evaluation, thereby providing a safer benchmark for future research. Moreover, CoEditor also shows great results in general editing. We release our code and dataset at https://github.com/kodenii/Responsible-Visual-Editing.
