Table of Contents
Fetching ...

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

William Berman, Alexander Peysakhovich

TL;DR

MUMU proposes a multimodal prompting framework that replaces the CLIP text encoder in a diffusion model with a vision–language model (Idefics2) to condition image generation on interleaved text–image prompts. By constructing a captioned dataset where image crops corresponding to caption words are inserted before those words, the model learns to harmonize conditioning from multiple images and perform style transfer without specialized conditioning modules. Trained end-to-end on open-weight components with LoRA on a single 8xH100 node, MUMU demonstrates notable capability to preserve conditioning details and generalize beyond the training task, while highlighting challenges in fine-grained detail consistency and evaluation of multimodal prompts. The work points to the promise of multimodal controllers for image generation and outlines scaling, data, and evaluation avenues to improve performance and reliability in practical applications.

Abstract

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

TL;DR

MUMU proposes a multimodal prompting framework that replaces the CLIP text encoder in a diffusion model with a vision–language model (Idefics2) to condition image generation on interleaved text–image prompts. By constructing a captioned dataset where image crops corresponding to caption words are inserted before those words, the model learns to harmonize conditioning from multiple images and perform style transfer without specialized conditioning modules. Trained end-to-end on open-weight components with LoRA on a single 8xH100 node, MUMU demonstrates notable capability to preserve conditioning details and generalize beyond the training task, while highlighting challenges in fine-grained detail consistency and evaluation of multimodal prompts. The work points to the promise of multimodal controllers for image generation and outlines scaling, data, and evaluation avenues to improve performance and reliability in practical applications.

Abstract

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

Paper Structure

This paper contains 13 sections, 12 figures, 1 table.

Figures (12)

  • Figure 1: An example of a multimodal prompt, and a resulting generation from our MUMU-Idefics2-SDXL model. The model inputs multimodal conditioning and outputs images.
  • Figure 2: MUMU-Idefics2-SDXL architecture. Red indicates modules which are trained, blue indicates frozen, black indicates embedding. Output is actual output from MUMU-Idefics2-SDXL to the given prompt.
  • Figure 3: A stylized example (not from our dataset) of the multimodal caption for a text-image pair. The object detection bounding boxes are cropped and inserted into the multimodal prompt before their corresponding words.
  • Figure 4: Multimodal prompts with direct inputs of conditioning into the diffusion model (MUMU) allows for much better detail preservation than ChatGPT+DALLE3 which uses images and text to construct a highly detailed text prompt for a text-to-image generator.
  • Figure 5: MUMU preserves more detail at higher tokens per image. At lower tokens per image, MUMU captures the gist of 'black robe'. At higher tokens per image, details such as the gold inlaid belt are better preserved.
  • ...and 7 more figures