Table of Contents
Fetching ...

Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection

Yuxuan Hu, Jian Chen, Yuhao Wang, Zixuan Li, Jing Xiong, Pengyue Jia, Wei Wang, Chengming Li, Xiangyu Zhao

TL;DR

This work addresses Sticker Response Selection (SRS) by jointly modeling emotional and intentional cues across visual and textual modalities. It introduces the Emotion and Intention Guided Multi-Modal Learning (EIGML) framework, featuring a Dual-Level Contrastive Framework (DLCF) for inter- and intra-modality alignment and a three-part fusion module (Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism) to progressively infuse emotional and intentional information. Extensive experiments on StickerChat and DSTC10-MOD show consistent improvements over state-of-the-art baselines, highlighting the importance of integrating emotion and intention for accurate sticker selection. The proposed approach achieves higher accuracy with lower computational cost, demonstrating practical impact for real-time multi-modal dialogue systems and sticker-based communication tasks.

Abstract

Stickers are widely used in online communication to convey emotions and implicit intentions. The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue. However, existing methods typically rely on semantic matching and model emotional and intentional cues separately, which can lead to mismatches when emotions and intentions are misaligned. To address this issue, we propose Emotion and Intention Guided Multi-Modal Learning (EIGML). This framework is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling and significantly improving selection accuracy. Specifically, we introduce Dual-Level Contrastive Framework to perform both intra-modality and inter-modality alignment, ensuring consistent representation of emotional and intentional features within and across modalities. In addition, we design an Intention-Emotion Guided Multi-Modal Fusion module that integrates emotional and intentional information progressively through three components: Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism. This design injects rich, effective information into the model and enables a deeper understanding of the dialogue, ultimately enhancing sticker selection performance. Experimental results on two public SRS datasets show that EIGML consistently outperforms state-of-the-art baselines, achieving higher accuracy and a better understanding of emotional and intentional features. Code is provided in the supplementary materials.

Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection

TL;DR

This work addresses Sticker Response Selection (SRS) by jointly modeling emotional and intentional cues across visual and textual modalities. It introduces the Emotion and Intention Guided Multi-Modal Learning (EIGML) framework, featuring a Dual-Level Contrastive Framework (DLCF) for inter- and intra-modality alignment and a three-part fusion module (Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism) to progressively infuse emotional and intentional information. Extensive experiments on StickerChat and DSTC10-MOD show consistent improvements over state-of-the-art baselines, highlighting the importance of integrating emotion and intention for accurate sticker selection. The proposed approach achieves higher accuracy with lower computational cost, demonstrating practical impact for real-time multi-modal dialogue systems and sticker-based communication tasks.

Abstract

Stickers are widely used in online communication to convey emotions and implicit intentions. The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue. However, existing methods typically rely on semantic matching and model emotional and intentional cues separately, which can lead to mismatches when emotions and intentions are misaligned. To address this issue, we propose Emotion and Intention Guided Multi-Modal Learning (EIGML). This framework is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling and significantly improving selection accuracy. Specifically, we introduce Dual-Level Contrastive Framework to perform both intra-modality and inter-modality alignment, ensuring consistent representation of emotional and intentional features within and across modalities. In addition, we design an Intention-Emotion Guided Multi-Modal Fusion module that integrates emotional and intentional information progressively through three components: Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism. This design injects rich, effective information into the model and enables a deeper understanding of the dialogue, ultimately enhancing sticker selection performance. Experimental results on two public SRS datasets show that EIGML consistently outperforms state-of-the-art baselines, achieving higher accuracy and a better understanding of emotional and intentional features. Code is provided in the supplementary materials.

Paper Structure

This paper contains 29 sections, 19 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples for potential failures in sticker selection when relying solely on emotional or intentional cues.
  • Figure 2: Overview of our proposed approach. Dual-Level Contrastive Framework enhances intra-modality and inter-modality alignment, while Intention-Emotion Guided Multi-Modal Fusion integrates three submodules: (1) Intention-Emotion Attention Fusion, (2) Emotion-Guided Intention Selection, and (3) Similarity-Adjusted Matching Mechanism.
  • Figure 3: On StickerChat, matching scores and Inter/Intra Distance Ratios. Higher ratios reflect better class separation in the t-SNE space, while higher matching scores indicate stronger alignment between dialogues and correct stickers.
  • Figure 4: t-SNE visualization of 500 StickerChat examples using different methods. For each test instance, one ground-truth and one randomly sampled negative image (from nine distractors) are shown.
  • Figure 5: Hyper-parameter analysis on $w_{e}$ and $w_{i}$. When adjusting the $w_{e}$, $w_{i}$ is set as 0.5, and vice versa.