Table of Contents
Fetching ...

HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, Jun Gao

Abstract

We propose an effective method for inserting adapters into text-to-image foundation models, which enables the execution of complex downstream tasks while preserving the generalization ability of the base model. The core idea of this method is to optimize the attention mechanism related to 2D feature maps, which enhances the performance of the adapter. This approach was validated on the task of meme video generation and achieved significant results. We hope this work can provide insights for post-training tasks of large text-to-image models. Additionally, as this method demonstrates good compatibility with SD1.5 derivative models, it holds certain value for the open-source community. Therefore, we will release the related code (\url{https://songkey.github.io/hellomeme}).

HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Abstract

We propose an effective method for inserting adapters into text-to-image foundation models, which enables the execution of complex downstream tasks while preserving the generalization ability of the base model. The core idea of this method is to optimize the attention mechanism related to 2D feature maps, which enhances the performance of the adapter. This approach was validated on the task of meme video generation and achieved significant results. We hope this work can provide insights for post-training tasks of large text-to-image models. Additionally, as this method demonstrates good compatibility with SD1.5 derivative models, it holds certain value for the open-source community. Therefore, we will release the related code (\url{https://songkey.github.io/hellomeme}).

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our solution consists of three modules. HMReferenceNet is used to extract Fidelity-Rich features from the reference image, while HMControlNet extracts high-level features such as head pose and facial expression information. HMDenoisingNet receives both sets of features and performs the core denoising function. It can also integrate a fine-tuned Animatediff module to generate continuous video frames.
  • Figure 2: This is the structural diagram of SKCrossAttention, which utilizes the Spatial Knitting Attention mechanism to fuse 2D feature maps with linear features. It performs cross-attention first row by row, then column by column.
  • Figure 3: This is the structural diagram of SKReferenceAttention, which uses the Spatial Knitting Attention mechanism to fuse two 2D feature maps. Specifically, the two feature maps are first concatenated row by row, followed by performing self-attention along the rows. Afterward, only the first half of each row is retained. A similar operation is then performed column by column.
  • Figure 4: Examples of self-reenactment performance comparisons, with five frames sampled from each video for illustration. The first row represents the ground truth, with the initial frame serving as the reference image (outlined in red dashed lines).
  • Figure 5: SD_EXP vs. SK_EXP
  • ...and 2 more figures