Table of Contents
Fetching ...

LLMGA: Multimodal Large Language Model based Generation Assistant

Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia

TL;DR

LLMGA introduces a language-prompt–driven approach to multimodal image generation and editing, replacing fixed visual embeddings with detailed prompts generated by a multimodal LLM. The method employs a two-stage training regime: first, the MLLM learns to produce rich generation prompts across four tasks; second, Stable Diffusion is fine-tuned to follow these prompts, with CLIP-token handling for long prompts. A diffusion-based restoration network (DiffRIR) reduces texture, brightness, and contrast disparities between newly generated and preserved regions during editing. Experimental results across T2I, inpainting/outpainting, and instruction-based editing demonstrate superior quality and interactive capability, with additional supplementary materials detailing data, control schemes, and qualitative demonstrations.

Abstract

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting \& outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.

LLMGA: Multimodal Large Language Model based Generation Assistant

TL;DR

LLMGA introduces a language-prompt–driven approach to multimodal image generation and editing, replacing fixed visual embeddings with detailed prompts generated by a multimodal LLM. The method employs a two-stage training regime: first, the MLLM learns to produce rich generation prompts across four tasks; second, Stable Diffusion is fine-tuned to follow these prompts, with CLIP-token handling for long prompts. A diffusion-based restoration network (DiffRIR) reduces texture, brightness, and contrast disparities between newly generated and preserved regions during editing. Experimental results across T2I, inpainting/outpainting, and instruction-based editing demonstrate superior quality and interactive capability, with additional supplementary materials detailing data, control schemes, and qualitative demonstrations.

Abstract

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting \& outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.
Paper Structure (25 sections, 14 equations, 20 figures, 7 tables)

This paper contains 25 sections, 14 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Some examples of LLMGA for assisting in image generation and editing. (1) T2I generation. LLMGA can refine the user's generation prompt to produce more vivid and vibrant images. (2) Similar image generation. LLMGA can understand the component and layout of the input images and generate a similar image. (3) Inpainting & Outpainting. LLMGA can provide detailed generation prompts based on user preferences and input images. (4) Instruction based editing. LLMGA can understand user instructions and realize accurate editing. (5) Interactive image generation and editing exemplify the comprehensive capabilities of LLMGA. Users can design satisfactory images by engaging in interactions with LLMGA, leveraging its vast knowledge.
  • Figure 1: An example of generating detailed descriptions for description data.
  • Figure 2: The overview of LLMGA. (a) In the first training stage, we train the MLLM to produce generation prompts based on provided instructions. Moreover, we construct a training dataset including four categories: prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. (b) In the second training stage, we optimize SD to adapt to the detailed generation prompts from MLLM. (c) In the inference stage, LLMGA can respond to user queries and assist in various tasks, such as image generation, inpainting & outpainting, and editing.
  • Figure 2: An example of generating detailed descriptions for instruction-based editing data.
  • Figure 3: T2I visual comparison. LLMGA can produce accurate and high-quality results.
  • ...and 15 more figures