Table of Contents
Fetching ...

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Zheng Gu, Shiyuan Yang, Jing Liao, Jing Huo, Yang Gao

TL;DR

Analogist tackles visual in-context learning by combining structural visual guidance with semantic textual prompts in a diffusion-inpainting framework. It introduces Self-Attention Cloning (SAC) to transfer fine-grained spatial relations from A to A' onto B and Cross-Attention Masking (CAM) to focus GPT-4V generated prompts on the target region B', guided by GPT-4V prompts derived from structured grid inputs. The method operates without any model fine-tuning, achieving state-of-the-art results across low-level, manipulation, and vision tasks, as evidenced by improvements in CLIP-direction scores, FID, and user studies, while maintaining reasonable inference times. This dual-prompting approach broadens the practical applicability of visual ICL and highlights the potential of integrating large multimodal models with diffusion-based inpainting for versatile image transformation tasks.

Abstract

Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

TL;DR

Analogist tackles visual in-context learning by combining structural visual guidance with semantic textual prompts in a diffusion-inpainting framework. It introduces Self-Attention Cloning (SAC) to transfer fine-grained spatial relations from A to A' onto B and Cross-Attention Masking (CAM) to focus GPT-4V generated prompts on the target region B', guided by GPT-4V prompts derived from structured grid inputs. The method operates without any model fine-tuning, achieving state-of-the-art results across low-level, manipulation, and vision tasks, as evidenced by improvements in CLIP-direction scores, FID, and user studies, while maintaining reasonable inference times. This dual-prompting approach broadens the practical applicability of visual ICL and highlights the potential of integrating large multimodal models with diffusion-based inpainting for versatile image transformation tasks.

Abstract

Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
Paper Structure (41 sections, 5 equations, 16 figures, 4 tables)

This paper contains 41 sections, 5 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of the proposed Analogist. A visual demonstration is defined by an example pair $A$ (woman holding a cat) and $A'$ (the same woman holding a tiger). Given a new image $B$ (another cat), we format these three images into a $2\times 2$ grid and tackle this problem by fill the missing image via a pretrained Stable Diffusion inpainting model. We employ GPT-4V to provide a proper text description (i.e., "close-up of a tiger's face") to further guide the inpainting process. During the process of model inference, Self-Attention Cloning (SAC) and Cross-Attention Masking (CAM) are introduced to encourage the model concentrate on the visual and textual prompts, thus enhance its in-context learning capacities. Source image: InstructPix2Pix brooks2023instructpix2pix.
  • Figure 2: Visualization of the attention relationships. Given an anchor point on image $A$ (shown in red, green, and blue colors), we calculate the attention values between this point and all regions of image $B$. Soucre image: InstructPix2Pix brooks2023instructpix2pix.
  • Figure 3: Detailed illustration of self-attention cloning (SAC). The sub self-attention map $\mathcal{M}_s(A',B')$ is set as the value of $\mathcal{M}_s(A,B)$, denoting cloning the relation between $A$ and $B$ to that of $A'$ and $B'$.
  • Figure 4: Detailed illustration of cross-attention masking (CAM). The sub cross-attention map between text embedding and regions $A$, $A'$, and $B$ are set to zero, making the semantic guidance more focused on region $B'$.
  • Figure 5: Comparison with other baseline methods, each row indicates one task, given the input image pair $A$, $A'$ and query image $B$. Since MAEVQGAN bar2022visual does not take text as input and DIA vsubrtova2023diffusion and VISII nguyen2023visual estimate the text prompts by extra optimization, the text prompts generated by GPT-4V prompting are only used by PromptDiffusion wang2023incontext and Analogist. Source images: ImageNet deng2009imagenet, LOL Chen2018Retinex, InstructPix2Pix brooks2023instructpix2pix, UBC-Fashion zablotskaia2019dwnet, ScanNet dai2017scannet, DAVIS perazzi2016benchmark.
  • ...and 11 more figures