Table of Contents
Fetching ...

FloAt: Flow Warping of Self-Attention for Clothing Animation Generation

Swasti Shreya Mishra, Kuldeep Kulkarni, Duygu Ceylan, Balaji Vasan Srinivasan

TL;DR

It is shown that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

Abstract

We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

FloAt: Flow Warping of Self-Attention for Clothing Animation Generation

TL;DR

It is shown that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

Abstract

We propose a diffusion model-based approach, FloAtControlNet to generate cinemagraphs composed of animations of human clothing. We focus on human clothing like dresses, skirts and pants. The input to our model is a text prompt depicting the type of clothing and the texture of clothing like leopard, striped, or plain, and a sequence of normal maps that capture the underlying animation that we desire in the output. The backbone of our method is a normal-map conditioned ControlNet which is operated in a training-free regime. The key observation is that the underlying animation is embedded in the flow of the normal maps. We utilize the flow thus obtained to manipulate the self-attention maps of appropriate layers. Specifically, the self-attention maps of a particular layer and frame are recomputed as a linear combination of itself and the self-attention maps of the same layer and the previous frame, warped by the flow on the normal maps of the two frames. We show that manipulating the self-attention maps greatly enhances the quality of the clothing animation, making it look more natural as well as suppressing the background artifacts. Through extensive experiments, we show that the method proposed beats all baselines both qualitatively in terms of visual results and user study. Specifically, our method is able to alleviate the background flickering that exists in other diffusion model-based baselines that we consider. In addition, we show that our method beats all baselines in terms of RMSE and PSNR computed using the input normal map sequences and the normal map sequences obtained from the output RGB frames. Further, we show that well-established evaluation metrics like LPIPS, SSIM, and CLIP scores that are generally for visual quality are not necessarily suitable for capturing the subtle motions in human clothing animations.

Paper Structure

This paper contains 18 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: We introduce a method for human clothing generation given a text prompt and a sequence of normal maps by manipulating the self-attention maps of normal-conditioned ControlNet using the flow information obtained from the normal maps. Our method is able to generate high-quality animation even for high-frequency textured dresses like stripes and tie and dye prints (see row 1). Please note that there are animations in the figure and are best viewed in Acrobat Reader.
  • Figure 2: Overview of FloAtControlNet. Given a text prompt and an input sequence of normal maps, we first compute the flow over the sequence of normal maps using the RAFT teed2020raft model and threshold it as mentioned to get a binary mask. We sequentially input the normal maps and text prompt into the normal-conditioned ControlNet zhang2023adding. Next, we perform self-attention feature injection inspired by Pix2Video ceylan2023pix2video to ensure temporal consistency of the visual features of the generated sequence. Further, during the denoising process, we recompute the self-attention map for a particular frame as a linear combination of itself and the flow-warped corresponding self-attention map from the previous frame to suppress the spurious motions in the generation. Finally, to eliminate the background flickering artifacts we do self-attention feature correction as stated in Equation \ref{['eq:attn_cor']}.
  • Figure 3: Self-attention visualisation. For a given FeatInControlNet generated sequence, we take the first PCA component of the self-attention map for frames 2, 3 and 4, at the last layer of the 3rd ConvUpBlock of the U-Net and plot its heatmap at the final denoising step. We mark 3 spatial regions of the self-attention maps, depicted as dotted bounding boxes (1, 2) and an ellipse (3) and its corresponding region in the generated frame. In the first column, we show the flow maps computed on frames 2 to 3 (top left) and frames 3 to 4 (bottom left). We also specify the mentioned spatial regions on the flow maps. We observe that even in the regions where there is zero flow (bounding boxes 1 & 2), the self-attention maps change noticeably. This results in undesirable motion in the generated sequence. This forms the motivation for the flow injection into the self-attention maps.
  • Figure 4: Qualitative results for our different methods and Rerender-A-Video Adapt yang2023rerender. Our approach is able to generate a temporally coherent sequence of frames, by suppressing artifacts in the no-motion region as demarcated by the dotted bounding boxes. Rerender-A-Video fails to suppress the spurious motion in the background.
  • Figure 5: Qualitative results for our different methods and CycleNet Reshading bertiche2023blowing, ControlVideo zhang2023controlvideo. Our approach is able to generate a temporally coherent sequence of frames, by suppressing artifacts in the no-motion region as demarcated by the dotted bounding boxes. CycleNet and ControlVideo fail to generate intricate motion present in the input normal sequences.
  • ...and 1 more figures