Table of Contents
Fetching ...

GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang

TL;DR

This work formalizes the problem of visual sticker semantic similarity and introduces the Triple-S benchmark as the first human-annotated dataset for this task, alongside the General Sticker Encoder (GSE) as a lightweight, general-purpose sticker representation model. By combining Triple-S with a large-scale MultiChat dataset and fine-tuning a CLIP backbone via contrastive learning, GSE learns embeddings that better capture sticker-level semantics and generalize to unseen stickers. Experiments show that standard image encoders perform poorly on nuanced sticker semantics, while GSE delivers robust generalization (e.g., on WXChallenge) and strong transfer to downstream tasks such as emotion classification and sticker-to-sticker retrieval, often achieving state-of-the-art results when integrated into larger systems. Overall, Triple-S provides a rigorous evaluation resource and GSE offers practical, transferable embeddings for sticker understanding, retrieval, and generation with lightweight compute needs.

Abstract

Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.

GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

TL;DR

This work formalizes the problem of visual sticker semantic similarity and introduces the Triple-S benchmark as the first human-annotated dataset for this task, alongside the General Sticker Encoder (GSE) as a lightweight, general-purpose sticker representation model. By combining Triple-S with a large-scale MultiChat dataset and fine-tuning a CLIP backbone via contrastive learning, GSE learns embeddings that better capture sticker-level semantics and generalize to unseen stickers. Experiments show that standard image encoders perform poorly on nuanced sticker semantics, while GSE delivers robust generalization (e.g., on WXChallenge) and strong transfer to downstream tasks such as emotion classification and sticker-to-sticker retrieval, often achieving state-of-the-art results when integrated into larger systems. Overall, Triple-S provides a rigorous evaluation resource and GSE offers practical, transferable embeddings for sticker understanding, retrieval, and generation with lightweight compute needs.

Abstract

Stickers have become a popular form of visual communication, yet understanding their semantic relationships remains challenging due to their highly diverse and symbolic content. In this work, we formally {define the Sticker Semantic Similarity task} and introduce {Triple-S}, the first benchmark for this task, consisting of 905 human-annotated positive and negative sticker pairs. Through extensive evaluation, we show that existing pretrained vision and multimodal models struggle to capture nuanced sticker semantics. To address this, we propose the {General Sticker Encoder (GSE)}, a lightweight and versatile model that learns robust sticker embeddings using both Triple-S and additional datasets. GSE achieves superior performance on unseen stickers, and demonstrates strong results on downstream tasks such as emotion classification and sticker-to-sticker retrieval. By releasing both Triple-S and GSE, we provide standardized evaluation tools and robust embeddings, enabling future research in sticker understanding, retrieval, and multimodal content generation. The Triple-S benchmark and GSE have been publicly released and are available here.

Paper Structure

This paper contains 34 sections, 3 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Examples of semantic pairings. The top row shows positive pairs, where both stickers convey similar emotions or actions. The bottom row shows negative pairs, where stickers differ in emotion, expression, or context despite visual or thematic similarities.