Table of Contents
Fetching ...

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

TL;DR

This work introduces X-Prompt, a purely auto-regressive vision-language foundation model that enables universal in-context image generation by compressing in-context exemplars into fixed-length tokens. It fuses three methodological pillars—in-context example compression, a task augmentation pipeline with reverse-task and difference-description signals, and retrieval-augmented image editing (RAIE)—within a unified text-and-image prediction objective. Empirical results across text-to-image generation, dense prediction, and image editing demonstrate strong generalization to unseen tasks when given in-context examples, with notable gains from dense-captioning and RAIE. Limitations include information loss from VQ-VAE compression and restricted cross-task generalization; future work calls for broader multi-modal pretraining to approach a GPT-3 moment in unified multi-modal in-context learning.

Abstract

In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

TL;DR

This work introduces X-Prompt, a purely auto-regressive vision-language foundation model that enables universal in-context image generation by compressing in-context exemplars into fixed-length tokens. It fuses three methodological pillars—in-context example compression, a task augmentation pipeline with reverse-task and difference-description signals, and retrieval-augmented image editing (RAIE)—within a unified text-and-image prediction objective. Empirical results across text-to-image generation, dense prediction, and image editing demonstrate strong generalization to unseen tasks when given in-context examples, with notable gains from dense-captioning and RAIE. Limitations include information loss from VQ-VAE compression and restricted cross-task generalization; future work calls for broader multi-modal pretraining to approach a GPT-3 moment in unified multi-modal in-context learning.

Abstract

In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.

Paper Structure

This paper contains 24 sections, 3 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: X-Prompt can perform multi-modal generation based on in-content examples in a pure auto-regressive foundation model.
  • Figure 2: Attention masking of X-Prompt for context feature compression and unified text and image next token prediction training.
  • Figure 3: Training data pair augmentation and list of training prototype tasks and subtasks. We introduce reverse task and difference description task through next text token prediction to improve the performance and generalizibility.
  • Figure 4: Qualitative Results on MagicBrush zhang2024magicbrush testset comparing with MagicBrush results w/ and w/o context examples.
  • Figure 5: Novel task in-context testing compared to OmniGen xiao2024omnigen.X-Prompt can achieve novel task generalization with a given example. While OmniGen xiao2024omnigen fall short in in-context learning (such as adapting to new color spectrum or preserve details when adding object to the image).
  • ...and 6 more figures