Table of Contents
Fetching ...

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao

TL;DR

This work tackles few-shot image manipulation by enabling in-context learning in a multimodal autoregressive model (InstaManip). It introduces a two-stage learning-and-applying paradigm realized via group self-attention and a relation regularization to disentangle transformation from content, formalized as $P(\mathcal{Y}|\mathcal{X},\mathcal{T},\mathcal{X}',\mathcal{Y}') = P(\mathcal{Z}|\mathcal{T},\mathcal{X}',\mathcal{Y}') \cdot P(\mathcal{Y}|\mathcal{X},\mathcal{Z})$. Empirically, InstaManip surpasses prior methods by substantial margins (e.g., $\geq 19\%$ in human evaluation) and scales with more exemplar images, demonstrating improved alignment with textual and visual guidance and robust generalization to unseen instructions. The findings highlight design principles for task-specific in-context learning in autoregressive models and suggest practical avenues for deploying few-shot image manipulation with strong reasoning capabilities.

Abstract

Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

TL;DR

This work tackles few-shot image manipulation by enabling in-context learning in a multimodal autoregressive model (InstaManip). It introduces a two-stage learning-and-applying paradigm realized via group self-attention and a relation regularization to disentangle transformation from content, formalized as . Empirically, InstaManip surpasses prior methods by substantial margins (e.g., in human evaluation) and scales with more exemplar images, demonstrating improved alignment with textual and visual guidance and robust generalization to unseen instructions. The findings highlight design principles for task-specific in-context learning in autoregressive models and suggest practical avenues for deploying few-shot image manipulation with strong reasoning capabilities.

Abstract

Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed , that can ntly learn a new image ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

Paper Structure

This paper contains 29 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: When learning a new image manipulation operation that is unseen in the training set (as shown above), textual instructions directly point out the subject and provide high-level semantic guidance, while exemplar images mitigate linguistic ambiguity and show more local details that are difficult to describe in language. Our proposed multi-modal autoregressive model -- InstaManip takes advantage of both textual and visual guidance to learn a representation of the desired transformation, and applies it to a new query image.
  • Figure 2: Comparison of InstructPix2Pix brooks2023instructpix2pix and our model. We exclude "Lamborghini" from training set for both models.
  • Figure 3: Comparison of the performance of plain self-attention (with causal mask) and the proposed group self-attention.
  • Figure 4: Overview of the proposed InstaManip architecture (left) and group self-attention mechanism (right, represented by query-key matrix). We first tokenize all input texts and images, and fill them in a prompt template with learnable manipulation and generation tokens. We input the prompt into the proposed model which is composed of $N$ blocks. The group self-attention layer in each block learns an explicit manipulation representation $\mathcal{Z}$ and applies it to the new query image. We forward final generation tokens and query image to the image decoder for final image synthesis. In the left part, we only show the self-attention correlations that connect with manipulation tokens or generation tokens for brevity. We also omit encoders, input projection layers and skip connections for simplicity.
  • Figure 5: Qualitative comparison with InstructPix2Pix and previous few-shot image manipulation methods. All instructions containing selected keywords (highlighted in red) are excluded from the training set, so that the models are not optimized on these manipulation operations. Our model follows the textual instruction better, and performs the transformation more aligned with exemplar image pairs.
  • ...and 11 more figures