Table of Contents
Fetching ...

Personalized Image Generation with Large Multimodal Models

Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, Xiangnan He

TL;DR

This work introduces Pigeon, a framework that enables personalized image generation with large multimodal models by inferring user preferences from noisy history and explicit multimodal instructions. It integrates three modules—mask generation, personalization, and image generation—into LaVIT-based generation and employs a two-stage preference alignment (masked reconstruction and Direct Preference Optimization) to overcome the lack of supervised triplet data. Quantitative and human evaluations across sticker and movie-poster tasks show that Pigeon outperforms DM-, LLM-, and LMM-based baselines in personalization while maintaining semantic alignment with reference images. The approach demonstrates strong potential for personalized content across domains, with practical implications for e-commerce, advertising, and design, and includes a public release of code and data.

Abstract

Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.

Personalized Image Generation with Large Multimodal Models

TL;DR

This work introduces Pigeon, a framework that enables personalized image generation with large multimodal models by inferring user preferences from noisy history and explicit multimodal instructions. It integrates three modules—mask generation, personalization, and image generation—into LaVIT-based generation and employs a two-stage preference alignment (masked reconstruction and Direct Preference Optimization) to overcome the lack of supervised triplet data. Quantitative and human evaluations across sticker and movie-poster tasks show that Pigeon outperforms DM-, LLM-, and LMM-based baselines in personalization while maintaining semantic alignment with reference images. The approach demonstrates strong potential for personalized content across domains, with practical implications for e-commerce, advertising, and design, and includes a public release of code and data.

Abstract

Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Personalized filtering selects the most relevant existing content while personalized generation creates new and customized ones, more precisely satisfying users' diverse content needs.
  • Figure 2: Two-stage preference alignments for Pigeon: given user-interacted images, the last image is treated as the target, with the preceding ones as user history.
  • Figure 3: Pigeon consists of three key modules: 1) mask generation module creates token-level masks for history and reference images, 2) personalized module encodes multimodal instructions and integrates them with masked history to generate personalized tokens, and 3) image generation module utilizes these tokens to produce personalized images.
  • Figure 4: In-depth analysis of the history mask and the two-stage preference alignment process.
  • Figure 5: Examples of generated images in sticker and movie poster scenarios, along with four user-interacted history images and one reference image.
  • ...and 2 more figures