PMG : Personalized Multimodal Generation with Large Language Models

Xiaoteng Shen; Rui Zhang; Xiaoyan Zhao; Jieming Zhu; Xi Xiao

PMG : Personalized Multimodal Generation with Large Language Models

Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, Xi Xiao

TL;DR

PMG addresses the challenge of personalized multimodal generation by using an LLM to translate user behavior into explicit keywords and soft preference embeddings, which condition a diffusion-based (or multimodal LLM) generator alongside target-item keywords. The approach introduces a bias-corrected LLM with multimodal tokens and P-Tuning V2, and optimizes a weighted objective $z = \alpha \cdot \log d_p + (1-\alpha) \cdot \log d_t$, balancing personalization scores $d_p$ with target fidelity scores $d_t$. Empirical results across fashion, movie posters, and emoticons show up to a significant improvement in personalization on perceptual metrics while maintaining generation accuracy, with ablations confirming the benefits of combining explicit keywords and soft embeddings and the value of prompt tuning and multimodal tokens. The work also demonstrates downstream gains for recommendation by using generated visuals as auxiliary features, paving the way for richer, personalized user experiences in multimodal AI systems.

Abstract

The emergence of large language models (LLMs) has revolutionized the capabilities of text comprehension and generation. Multi-modal generation attracts great attention from both the industry and academia, but there is little work on personalized generation, which has important applications such as recommender systems. This paper proposes the first method for personalized multimodal generation using LLMs, showcases its applications and validates its performance via an extensive experimental study on two datasets. The proposed method, Personalized Multimodal Generation (PMG for short) first converts user behaviors (e.g., clicks in recommender systems or conversations with a virtual assistant) into natural language to facilitate LLM understanding and extract user preference descriptions. Such user preferences are then fed into a generator, such as a multimodal LLM or diffusion model, to produce personalized content. To capture user preferences comprehensively and accurately, we propose to let the LLM output a combination of explicit keywords and implicit embeddings to represent user preferences. Then the combination of keywords and embeddings are used as prompts to condition the generator. We optimize a weighted sum of the accuracy and preference scores so that the generated content has a good balance between them. Compared to a baseline method without personalization, PMG has a significant improvement on personalization for up to 8% in terms of LPIPS while retaining the accuracy of generation.

PMG : Personalized Multimodal Generation with Large Language Models

TL;DR

, balancing personalization scores

with target fidelity scores

. Empirical results across fashion, movie posters, and emoticons show up to a significant improvement in personalization on perceptual metrics while maintaining generation accuracy, with ablations confirming the benefits of combining explicit keywords and soft embeddings and the value of prompt tuning and multimodal tokens. The work also demonstrates downstream gains for recommendation by using generated visuals as auxiliary features, paving the way for richer, personalized user experiences in multimodal AI systems.

Abstract

Paper Structure (28 sections, 9 equations, 14 figures, 4 tables)

This paper contains 28 sections, 9 equations, 14 figures, 4 tables.

Introduction
Related work
Multimodal Generation
LLM for Recommendation
Method
Overview
Generate Explicit Keywords
Preprocess of user behaviors.
Construction of prompt.
Generate Soft Preference Embeddings
Bias Correction LLM
Training with multimodal supervision.
Balancing the accuracy score and the preference score
Experiment
Experimental Setup
...and 13 more sections

Figures (14)

Figure 1: The personalized generation based on user behaviors produces emoticons of a cute cat that are more appealing to cat lovers compared to the normal generation.
Figure 2: Overview of our method. By utilizing user behaviors and a target item as input, we generate personalized multimodal content for the item, taking a movie poster as an example in the figure.
Figure 3: Model designed to train soft preference embeddings.
Figure 4: Generated image comparison of our method PMG in the costume scene. Four typical users with different styles of historical items are picked as input to generate images of shoes and a shirt.
Figure 5: Generated image comparison of our method PMG in the movie poster scene. Three users with different movie interests are picked as input to generate posters of movie True Crime and Titanic.
...and 9 more figures

PMG : Personalized Multimodal Generation with Large Language Models

TL;DR

Abstract

PMG : Personalized Multimodal Generation with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)