Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection
Yuxuan Hu, Jian Chen, Yuhao Wang, Zixuan Li, Jing Xiong, Pengyue Jia, Wei Wang, Chengming Li, Xiangyu Zhao
TL;DR
This work addresses Sticker Response Selection (SRS) by jointly modeling emotional and intentional cues across visual and textual modalities. It introduces the Emotion and Intention Guided Multi-Modal Learning (EIGML) framework, featuring a Dual-Level Contrastive Framework (DLCF) for inter- and intra-modality alignment and a three-part fusion module (Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism) to progressively infuse emotional and intentional information. Extensive experiments on StickerChat and DSTC10-MOD show consistent improvements over state-of-the-art baselines, highlighting the importance of integrating emotion and intention for accurate sticker selection. The proposed approach achieves higher accuracy with lower computational cost, demonstrating practical impact for real-time multi-modal dialogue systems and sticker-based communication tasks.
Abstract
Stickers are widely used in online communication to convey emotions and implicit intentions. The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue. However, existing methods typically rely on semantic matching and model emotional and intentional cues separately, which can lead to mismatches when emotions and intentions are misaligned. To address this issue, we propose Emotion and Intention Guided Multi-Modal Learning (EIGML). This framework is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling and significantly improving selection accuracy. Specifically, we introduce Dual-Level Contrastive Framework to perform both intra-modality and inter-modality alignment, ensuring consistent representation of emotional and intentional features within and across modalities. In addition, we design an Intention-Emotion Guided Multi-Modal Fusion module that integrates emotional and intentional information progressively through three components: Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism. This design injects rich, effective information into the model and enables a deeper understanding of the dialogue, ultimately enhancing sticker selection performance. Experimental results on two public SRS datasets show that EIGML consistently outperforms state-of-the-art baselines, achieving higher accuracy and a better understanding of emotional and intentional features. Code is provided in the supplementary materials.
