Table of Contents
Fetching ...

VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation

Zhiqiang Yuan, Jiapei Zhang, Ying Deng, Yeshuang Zhu, Jie Zhou, Jinchao Zhang

TL;DR

This work tackles animated sticker generation by introducing VSD2M, the largest vision-language sticker dataset to date, encompassing both static and animated GIFs to address data scarcity. It introduces the Spatial Temporal Interaction (STI) layer to better exploit spatial-temporal information in discrete, low-frame-rate stickers, enabling more faithful and semantically aligned generation. The authors benchmark several transformer- and diffusion-based video-generation methods on VSD2M, showing STI improves VQA and FVD while providing a solid foundation for ASG research via comprehensive baselines and analyses. By releasing a million-scale, richly labeled dataset and a specialized generation module, this work offers meaningful infrastructure for advancing intelligent creation of animated stickers in real-world, cross-modal settings.

Abstract

As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named VSD2M at a two-million scale that contains static and animated stickers. Secondly, to improve the performance of traditional video generation methods on ASG tasks with discrete characteristics, we propose a Spatial Temporal Interaction (STI) layer that utilizes semantic interaction and detail preservation to address the issue of insufficient information utilization. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic supervision on ASG task. To the best of our knowledge, this is the most comprehensive large-scale benchmark for multi-frame animated sticker generation, and we hope this work can provide valuable inspiration for other scholars in intelligent creation.

VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation

TL;DR

This work tackles animated sticker generation by introducing VSD2M, the largest vision-language sticker dataset to date, encompassing both static and animated GIFs to address data scarcity. It introduces the Spatial Temporal Interaction (STI) layer to better exploit spatial-temporal information in discrete, low-frame-rate stickers, enabling more faithful and semantically aligned generation. The authors benchmark several transformer- and diffusion-based video-generation methods on VSD2M, showing STI improves VQA and FVD while providing a solid foundation for ASG research via comprehensive baselines and analyses. By releasing a million-scale, richly labeled dataset and a specialized generation module, this work offers meaningful infrastructure for advancing intelligent creation of animated stickers in real-world, cross-modal settings.

Abstract

As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named VSD2M at a two-million scale that contains static and animated stickers. Secondly, to improve the performance of traditional video generation methods on ASG tasks with discrete characteristics, we propose a Spatial Temporal Interaction (STI) layer that utilizes semantic interaction and detail preservation to address the issue of insufficient information utilization. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic supervision on ASG task. To the best of our knowledge, this is the most comprehensive large-scale benchmark for multi-frame animated sticker generation, and we hope this work can provide valuable inspiration for other scholars in intelligent creation.

Paper Structure

This paper contains 18 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of data collection and processing, which can be divided into four stages: web crawling, data filtering, annotation and dataset splitting. During the data annotation process, we use manually labeled data to fine-tune different models to obtain high-quality semi-automatic annotation results.
  • Figure 2: Two samples of VSD2M, in which GIFs is framed for visualization. The blue description shows part of the action in the GIFs.
  • Figure 3: Word cloud distribution of the description in VSD2M, which contains information that reflects the motion in GIFs, such as movement, open, $etc$.
  • Figure 4: Visual analysis of VSD2M. (a) Frequency count of top 25 trigger words. (b) Statistics of frame number, note that we only count multi-frame animated stickers. (c) Frequency count of top 35 words in descriptions. (d) Statistics of caption length.
  • Figure 5: Visual comparison for animated sticker generation between VideoLDM, VideoFactory, I2VGen-XL and ours. The text prompts are as follows, left: "A cute rabbit setting off firecrackers", middle: "A little bear waving his hands up and down", right: "A cartoon little fox waving with a heart above his head". More results can be seen in https://xiaoyuan1996.github.io/files/VSD2M/index.html.
  • ...and 4 more figures