Table of Contents
Fetching ...

Animated Stickers: Bringing Stickers to Life with Video Diffusion

David Yan, Winnie Zhang, Luxin Zhang, Anmol Kalia, Dingkang Wang, Ankit Ramchandani, Miao Liu, Albert Pumarola, Edgar Schoenfeld, Elliot Blanchard, Krishna Narni, Yaqiao Luo, Lawrence Chen, Guan Pang, Ali Thabet, Peter Vajda, Amy Bearman, Licheng Yu

TL;DR

This work addresses animating static stickers by bridging the domain gap between natural videos and sticker-style motion. It proposes a spatiotemporal latent diffusion framework conditioned on image $c_I$ and text $c_T$, augmented with temporal layers and an IP2P-style conditioning scheme, and leverages an ensemble-of-teachers HITL fine-tuning pipeline with motion bucketing and middle-frame conditioning. Key contributions include the ensemble-of-teachers HITL approach, motion-aware data strategies, efficient architectures, and distillation techniques that reduce inference to eight solver steps and deliver eight-frame videos in under $1$ second, demonstrated on 8-frame outputs with high motion quality. The resulting system yields production-ready animated stickers with improved motion size, relevance, and looping behavior, enabling scalable deployment for social expression and potentially informing domain-adaptive video generation in other specialized visual domains.

Abstract

We introduce animated stickers, a video diffusion model which generates an animation conditioned on a text prompt and static sticker image. Our model is built on top of the state-of-the-art Emu text-to-image model, with the addition of temporal layers to model motion. Due to the domain gap, i.e. differences in visual and motion style, a model which performed well on generating natural videos can no longer generate vivid videos when applied to stickers. To bridge this gap, we employ a two-stage finetuning pipeline: first with weakly in-domain data, followed by human-in-the-loop (HITL) strategy which we term ensemble-of-teachers. It distills the best qualities of multiple teachers into a smaller student model. We show that this strategy allows us to specifically target improvements to motion quality while maintaining the style from the static image. With inference optimizations, our model is able to generate an eight-frame video with high-quality, interesting, and relevant motion in under one second.

Animated Stickers: Bringing Stickers to Life with Video Diffusion

TL;DR

This work addresses animating static stickers by bridging the domain gap between natural videos and sticker-style motion. It proposes a spatiotemporal latent diffusion framework conditioned on image and text , augmented with temporal layers and an IP2P-style conditioning scheme, and leverages an ensemble-of-teachers HITL fine-tuning pipeline with motion bucketing and middle-frame conditioning. Key contributions include the ensemble-of-teachers HITL approach, motion-aware data strategies, efficient architectures, and distillation techniques that reduce inference to eight solver steps and deliver eight-frame videos in under second, demonstrated on 8-frame outputs with high motion quality. The resulting system yields production-ready animated stickers with improved motion size, relevance, and looping behavior, enabling scalable deployment for social expression and potentially informing domain-adaptive video generation in other specialized visual domains.

Abstract

We introduce animated stickers, a video diffusion model which generates an animation conditioned on a text prompt and static sticker image. Our model is built on top of the state-of-the-art Emu text-to-image model, with the addition of temporal layers to model motion. Due to the domain gap, i.e. differences in visual and motion style, a model which performed well on generating natural videos can no longer generate vivid videos when applied to stickers. To bridge this gap, we employ a two-stage finetuning pipeline: first with weakly in-domain data, followed by human-in-the-loop (HITL) strategy which we term ensemble-of-teachers. It distills the best qualities of multiple teachers into a smaller student model. We show that this strategy allows us to specifically target improvements to motion quality while maintaining the style from the static image. With inference optimizations, our model is able to generate an eight-frame video with high-quality, interesting, and relevant motion in under one second.
Paper Structure (23 sections, 1 equation, 7 figures, 2 tables)

This paper contains 23 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An example of the types of training data used, showing the domain gap between natural videos (a), short animations (b), and HITL-filtered in-domain videos (c).
  • Figure 2: Overall architecture of our animated stickers model (left), and addition of temporal layers to transformer and convolutional blocks (right). We employ a spatiotemporal latent diffusion model (LDM), where The UNet consists of convolutional stages and attention stages, where the attention stages perform both self and cross-attention to text embeddings (CLIP is always used, FLAN-T5XL is optional depending on the architecture). Temporal layers are added after convolution and spatial transformers, with identity-initialization so that a newly initialized model can load T2I weights and reproduce the T2I model.
  • Figure 3: A mock-up of the annotation interface. To the left, annotators select any number out of the available videos, or select "I wouldn't share any of these image" if none of the videos are acceptable. To the right, annotators can see the caption, and auto-looped animated sticker videos.
  • Figure 4: Ensemble-of-teachers finetuning, where a number of pretrained, large general-purpose video models are finetuned using finetuning data and different recipes, which vary by data order and sampling framerate. This results in a set of "teacher" models, which are used to generate videos with the HITL prompt set. After human filtering, high-quality HITL data is used to finetune a set of small, efficient pretrained models and downselected into student model candidates.
  • Figure 5: Examples showing the effect of finetuning versus a general-purpose (out-of-domain) video model trained on natural videos. In-domain and HITL finetuning has the effect of a) increasing secondary motion (e.g. in faces, background objects, etc.), b) giving the subject a relevant animation rather than adding a bulk motion, and c) reducing motion artifacts and morphing. Top: the general-purpose model gives the cat an up-and-down bobbing motion, whereas the finetuned model animates a correct running movement. Bottom: the general-purpose model adds morphing to the video, whereas the finetuned model correctly animates dancing.
  • ...and 2 more figures