Table of Contents
Fetching ...

User-Friendly Customized Generation with Multi-Modal Prompts

Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, Liqing Zhang

TL;DR

The paper tackles the challenge of user-friendly, customized text-to-image generation by introducing multi-modal prompts that fuse a single per-concept image with text to guide generation. It combines BLIP-based image captioning and LLM-driven semantic analysis to extract a precise main-object description, then fine-tunes diffusion models with a composite descriptor to preserve prior knowledge, enabling accurate object and scene customization. Empirical results on a 15-object dataset show consistent improvements over Textual Inversion, DreamBooth, and Custom Diffusion in image-text alignment (DINO, CLIP-I, CLIP-T) and support from a human preference study, highlighting practical gains in fidelity and usability. The work identifies limitations due to current diffusion-model constraints and outlines future directions, including SDXL and expanded multi-modal-prompt semantics, to broaden applicability to multi-object and more complex prompts.

Abstract

Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at $\href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$.

User-Friendly Customized Generation with Multi-Modal Prompts

TL;DR

The paper tackles the challenge of user-friendly, customized text-to-image generation by introducing multi-modal prompts that fuse a single per-concept image with text to guide generation. It combines BLIP-based image captioning and LLM-driven semantic analysis to extract a precise main-object description, then fine-tunes diffusion models with a composite descriptor to preserve prior knowledge, enabling accurate object and scene customization. Empirical results on a 15-object dataset show consistent improvements over Textual Inversion, DreamBooth, and Custom Diffusion in image-text alignment (DINO, CLIP-I, CLIP-T) and support from a human preference study, highlighting practical gains in fidelity and usability. The work identifies limitations due to current diffusion-model constraints and outlines future directions, including SDXL and expanded multi-modal-prompt semantics, to broaden applicability to multi-object and more complex prompts.

Abstract

Text-to-image generation models have seen considerable advancement, catering to the increasing interest in personalized image creation. Current customization techniques often necessitate users to provide multiple images (typically 3-5) for each customized object, along with the classification of these objects and descriptive textual prompts for scenes. This paper questions whether the process can be made more user-friendly and the customization more intricate. We propose a method where users need only provide images along with text for each customization topic, and necessitates only a single image per visual concept. We introduce the concept of a ``multi-modal prompt'', a novel integration of text and images tailored to each customization concept, which simplifies user interaction and facilitates precise customization of both objects and scenes. Our proposed paradigm for customized text-to-image generation surpasses existing finetune-based methods in user-friendliness and the ability to customize complex objects with user-friendly inputs. Our code is available at .
Paper Structure (32 sections, 4 equations, 8 figures, 3 tables)

This paper contains 32 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: User-Friendly Customization Through Multi-Modal Prompts: Leveraging multi-modal prompts enables users to precisely tailor both objects and scenes of interest. When provided with such prompts, our paradigm efficiently generates images that not only feature the specified objects within the desired scenes but also excel in the detailed customization of complex objects, showcasing our method's superior performance and user-centric approach.
  • Figure 2: Two Examples of Multi-Modal Prompts: The first example features an image of a car showcasing a red and black color scheme, isolated against a void of other objects or background elements. The second example displays a cartoon orange positioned on a road, set against a forest backdrop, illustrating the versatility of multi-modal prompts in depicting complex and diverse scenarios.
  • Figure 3: The overview of our paradigm. Our innovative paradigm is divided into two crucial components: the extraction of main object descriptions and the customization of concepts while preserving detailed prior knowledge. Initially, the process involves extracting descriptions of the main objects within the multi-modal prompt images, which is executed in two phases: image captioning using BLIP, followed by semantic analysis with ChatGPT. Subsequently, the second component utilizes these extracted descriptions to maintain the detailed prior knowledge of the main objects, thereby enhancing the customization performance.
  • Figure 4: Qualitative comparisons. This figure showcases sample images generated by Dreambooth, Ours-Dreambooth, Custom Diffusion, Ours-Custom Diffusion, the extraction-directly method, and the finetuning-directly method across distinct multi-modal prompts. For an in-depth analysis, please see \ref{['sec:comparisons-current-methods']} and \ref{['sec:ablation-study']}.
  • Figure 5: Sample guide for evaluation of Main Object Description Extraction. Participants are requested to classify each pair of an image and its extracted description into one of four categories: completely consistent, basically consistent, basically inconsistent, and completely inconsistent. Scores range from 3 to 0, corresponding to these categories in descending order of consistency.
  • ...and 3 more figures