Table of Contents
Fetching ...

ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

Ting Zhang, Zhiqiang Yuan, Yeshuang Zhu, Jinchao Zhang

TL;DR

This work tackles generating animated stickers with high-quality transparent channels, a task where existing video matting struggles with semi-open regions and diffusion-based methods suffer from temporal flicker. It introduces ILDiff, which combines implicit layout distillation of SAM features with a temporal modeling branch to enforce layout-aware, temporally coherent alpha channels within a latent diffusion framework. A new Transparent Animated Sticker Dataset (TASD) with 0.32M samples and a 200-sample TASD-T test set is provided to support evaluation and future research. Empirical results show ILDiff delivers finer and smoother transparent channels than strong baselines such as Matting Anything and Layer Diffusion, and ablations highlight the importance of the temporal depth in the layout adapter. The work offers practical advances for animated sticker generation and resources for the community by releasing code and TASD.

Abstract

High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link https://xiaoyuan1996.github.io.

ILDiff: Generate Transparent Animated Stickers by Implicit Layout Distillation

TL;DR

This work tackles generating animated stickers with high-quality transparent channels, a task where existing video matting struggles with semi-open regions and diffusion-based methods suffer from temporal flicker. It introduces ILDiff, which combines implicit layout distillation of SAM features with a temporal modeling branch to enforce layout-aware, temporally coherent alpha channels within a latent diffusion framework. A new Transparent Animated Sticker Dataset (TASD) with 0.32M samples and a 200-sample TASD-T test set is provided to support evaluation and future research. Empirical results show ILDiff delivers finer and smoother transparent channels than strong baselines such as Matting Anything and Layer Diffusion, and ablations highlight the importance of the temporal depth in the layout adapter. The work offers practical advances for animated sticker generation and resources for the community by releasing code and TASD.

Abstract

High-quality animated stickers usually contain transparent channels, which are often ignored by current video generation models. To generate fine-grained animated transparency channels, existing methods can be roughly divided into video matting algorithms and diffusion-based algorithms. The methods based on video matting have poor performance in dealing with semi-open areas in stickers, while diffusion-based methods are often used to model a single image, which will lead to local flicker when modeling animated stickers. In this paper, we firstly propose an ILDiff method to generate animated transparent channels through implicit layout distillation, which solves the problems of semi-open area collapse and no consideration of temporal information in existing methods. Secondly, we create the Transparent Animated Sticker Dataset (TASD), which contains 0.32M high-quality samples with transparent channel, to provide data support for related fields. Extensive experiments demonstrate that ILDiff can produce finer and smoother transparent channels compared to other methods such as Matting Anything and Layer Diffusion. Our code and dataset will be released at link https://xiaoyuan1996.github.io.
Paper Structure (10 sections, 6 figures, 3 tables)

This paper contains 10 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Framework of the proposed ILDiff model. Compared with layer diffusion, ILDiff adds a layout adapter, which learns the implicit layout information in animated stickers by distilling SAM and constructs a temporal modeling branch to improve the local flickering problem encountered by traditional diffusion-based methods. During training, the loss committee consisting of $\mathcal{L}_g$, $\mathcal{L}_{rgb}$, and $\mathcal{L}_p$ is used to jointly optimize the model.
  • Figure 2: Two samples of TASD, in which GIFs is framed for visualization. The red word shows the trigger word. See https://xiaoyuan1996.github.io for animated samples.
  • Figure 3: Visual analysis of TASD. (a) Frequency count of top 15 trigger words. (b) Statistics of frame number.
  • Figure 4: Manual comparison of generated transparent channel by different methods on (a) frame smooth and (b) hole residue.
  • Figure 5: Visual comparison for transparent channel generation between Layer Diffusion and Ours. See https://xiaoyuan1996.github.io for more results.
  • ...and 1 more figures