Table of Contents
Fetching ...

Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization

Connor Dunlop, Matthew Zheng, Kavana Venkatesh, Pinar Yanardag

TL;DR

This paper tackles personalized image editing in text-to-image diffusion models by introducing Collaborative Direct Preference Optimization (C-DPO), which conditions edits on per-user embeddings learned from a graph of like-minded preferences. A lightweight GraphSAGE-based GNN computes contextual user representations that are softly integrated into a DPO objective to balance individual alignment with collaborative signals from neighbors. The training pipeline uses a two-stage approach: supervised fine-tuning to create a reference policy and subsequent C-DPO fine-tuning with user conditioning and graph-based regularization; personalization is realized through soft prompt tokens without altering the base diffusion model. Experiments on a large synthetic dataset demonstrate improved user-specific alignment and image fidelity over baselines, with user studies confirming perceptual personalization gains. The work advances practical, scalable personalized editing for diffusion models, while acknowledging potential biases and limitations in new-user scenarios and synthetic data.

Abstract

Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model's editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.

Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization

TL;DR

This paper tackles personalized image editing in text-to-image diffusion models by introducing Collaborative Direct Preference Optimization (C-DPO), which conditions edits on per-user embeddings learned from a graph of like-minded preferences. A lightweight GraphSAGE-based GNN computes contextual user representations that are softly integrated into a DPO objective to balance individual alignment with collaborative signals from neighbors. The training pipeline uses a two-stage approach: supervised fine-tuning to create a reference policy and subsequent C-DPO fine-tuning with user conditioning and graph-based regularization; personalization is realized through soft prompt tokens without altering the base diffusion model. Experiments on a large synthetic dataset demonstrate improved user-specific alignment and image fidelity over baselines, with user studies confirming perceptual personalization gains. The work advances practical, scalable personalized editing for diffusion models, while acknowledging potential biases and limitations in new-user scenarios and synthetic data.

Abstract

Text-to-image (T2I) diffusion models have made remarkable strides in generating and editing high-fidelity images from text. Yet, these models remain fundamentally generic, failing to adapt to the nuanced aesthetic preferences of individual users. In this work, we present the first framework for personalized image editing in diffusion models, introducing Collaborative Direct Preference Optimization (C-DPO), a novel method that aligns image edits with user-specific preferences while leveraging collaborative signals from like-minded individuals. Our approach encodes each user as a node in a dynamic preference graph and learns embeddings via a lightweight graph neural network, enabling information sharing across users with overlapping visual tastes. We enhance a diffusion model's editing capabilities by integrating these personalized embeddings into a novel DPO objective, which jointly optimizes for individual alignment and neighborhood coherence. Comprehensive experiments, including user studies and quantitative benchmarks, demonstrate that our method consistently outperforms baselines in generating edits that are aligned with user preferences.

Paper Structure

This paper contains 34 sections, 7 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Our framework performs personalized image editing aligned with user's preference via a novel DPO objective that learns user preferences while leveraging collaborative signals from like-minded individuals.
  • Figure 2: 1) We first fine-tune a language model so it can generate precise editing instructions. 2) We then introduce a graph-aware DPO objective that leverages collaborative user data to learn individual editing preferences. 3) After training, the system takes an input image and a user profile, produces tailored editing instructions, and outputs the corresponding personalized edit.
  • Figure 3: (a) Qualitative Results for Individual Users on Different Objects. Our framework is able to incorporate personalized elements into the image editing process, such as adding neon or futuristic elements for Futuristic Techie profile. (b) Qualitative Results for User-Provided Personalized Edits. Our framework allows users to provide additional guidance while performing personalized edits.
  • Figure 4: Qualitative Results for a Single User on Diverse Objects. For a user who loves unicorns, rainbows, and vibrant, playful palettes, our system infuses that whimsical aesthetic into a wide range of objects - from cars and guitars to watchtowers.
  • Figure 5: Qualitative Results for Different Users on the Same Objects. Our framework tailors image edits to each user’s preferences. In the first row, for example, a user who loves unicorns, rainbows, and playful color schemes sees their inputs transformed to match that personalized aesthetic preferences.
  • ...and 7 more figures