Table of Contents
Fetching ...

Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia

TL;DR

Instruct-Imagen tackles the challenge of heterogeneous image generation by introducing multi-modal instructions that unify text, style, subject, and other modalities into a single task representation. The model extends a pre-trained text-to-image diffusion backbone with a cross-attention mechanism conditioned on encoded multi-modal instructions and uses a two-stage training pipeline: retrieval-augmented pre-training to ground generations in relevant multimodal context, followed by instruction-tuning on diverse tasks. Across 11 datasets spanning text-to-image, control-to-image, subject-driven, style generation, and style transfer, Instruct-Imagen matches or surpasses state-of-the-art task-specific models and demonstrates strong zero-shot generalization to unseen, complex instructions. The approach improves controllability and alignment in image generation while maintaining practical inference speed, suggesting broad applicability in flexible, instruction-driven generation with external multimodal grounding.

Abstract

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Instruct-Imagen: Image Generation with Multi-modal Instruction

TL;DR

Instruct-Imagen tackles the challenge of heterogeneous image generation by introducing multi-modal instructions that unify text, style, subject, and other modalities into a single task representation. The model extends a pre-trained text-to-image diffusion backbone with a cross-attention mechanism conditioned on encoded multi-modal instructions and uses a two-stage training pipeline: retrieval-augmented pre-training to ground generations in relevant multimodal context, followed by instruction-tuning on diverse tasks. Across 11 datasets spanning text-to-image, control-to-image, subject-driven, style generation, and style transfer, Instruct-Imagen matches or surpasses state-of-the-art task-specific models and demonstrates strong zero-shot generalization to unseen, complex instructions. The approach improves controllability and alignment in image generation while maintaining practical inference speed, suggesting broad applicability in flexible, instruction-driven generation with external multimodal grounding.

Abstract

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.
Paper Structure (24 sections, 2 equations, 18 figures, 5 tables)

This paper contains 24 sections, 2 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Zero-shot generalization of Instruct-Imagen. Our model understands the multi-modal instruction (left) to generate image (right) that reflects the complex and unseen image transformation.
  • Figure 2: Illustration on how multi-modal intruction uniformly express existing image generation tasks and extends to new tasks. Examples in this figure are retrieved from zhang2023addingsutisohn2023styledrop
  • Figure 3: Overview of the two-staged training pipeline for the proposed Instruct-Imagen model.
  • Figure 4: Human Study on prior methods, baselines, and Instruct-Imagen. Instruct-Imagen can perform on par or better comparing to the baselines and prior methods, with best generalization capability to novel tasks. Instruct-Imagen does not require any fine-tuning for all tasks (particularly style/subject-related), and inferences at an average speed of 18.2 seconds per example (on TPUv4).
  • Figure 5: Comparison on a subset of in-domain tasks. Examples generated from prior methods, baselines, and Instruct-Imagen. We visualize the multi-modal instruction for human intuitive understanding (models are evaluated with in-distribution inputs).
  • ...and 13 more figures