STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

Yiqun Zhang; Fanheng Kong; Peidong Wang; Shuang Sun; Lingshuai Wang; Shi Feng; Daling Wang; Yifei Zhang; Kaisong Song

STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

Yiqun Zhang, Fanheng Kong, Peidong Wang, Shuang Sun, Lingshuai Wang, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song

TL;DR

STICKERCONV tackles the scarcity of multimodal empathetic dialogue data by introducing Agent4SC, a multi-agent LLM system that utilizes stickers to simulate realistic conversations, and by releasing the STICKERCONV dataset. Building on this, PEGS provides an end-to-end framework that perceives multimodal input and generates contextually appropriate text and stickers, with retrieval and generation strategies enhancing expressiveness. The paper introduces robust LLM-based and human-centric evaluation metrics to assess empathy, consistency, and modality synergy, demonstrating PEGS's superiority over baselines in both textual and multimodal outputs. This work advances multimodal empathetic dialogue by offering a scalable data source, a unified perceptual-generation framework, and comprehensive evaluation protocols to enable more engaging human-AI conversations.

Abstract

Stickers, while widely recognized for enhancing empathetic communication in online interactions, remain underexplored in current empathetic dialogue research, notably due to the challenge of a lack of comprehensive datasets. In this paper, we introduce the Agent for STICKERCONV (Agent4SC), which uses collaborative agent interactions to realistically simulate human behavior with sticker usage, thereby enhancing multimodal empathetic communication. Building on this foundation, we develop a multimodal empathetic dialogue dataset, STICKERCONV, comprising 12.9K dialogue sessions, 5.8K unique stickers, and 2K diverse conversational scenarios. This dataset serves as a benchmark for multimodal empathetic generation. To advance further, we propose PErceive and Generate Stickers (PEGS), a multimodal empathetic response generation framework, complemented by a comprehensive set of empathy evaluation metrics based on LLM. Our experiments demonstrate PEGS's effectiveness in generating contextually relevant and emotionally resonant multimodal empathetic responses, contributing to the advancement of more nuanced and engaging empathetic dialogue systems.

STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

TL;DR

Abstract

Paper Structure (62 sections, 9 equations, 24 figures, 10 tables)

This paper contains 62 sections, 9 equations, 24 figures, 10 tables.

Introduction
Related Work
Empathetic Response Generation
Large Multimodal Models
LLM-Based Agents
Agent for S TICKERC ONV
Profile Module
Tool Module
Memory Module
Plan Module
Action Module
Manager Agent
The S TICKERC ONV Dataset
PEGS
Multimodal Input Perception
...and 47 more sections

Figures (24)

Figure 1: An example of multimodal conversation in the S TICKERC ONV. Both parties can utilize the stickers to express their emotions, which enhances interactivity and expression. The assistant can empathize with the user according to the conversation (green text).
Figure 2: The overview of Agent4SC. Memory and Plan modules enable the agent to mimic human observation and thought, overcoming LLMs' inability to grasp nuanced emotions. The Action module supports generating insights with human-like emotional reactions. The Profile module gives each agent distinct reflections and actions. Furthermore, Agent4SC uses stickers as a Tool for more natural conversation, allowing the agent to choose stickers like humans. These modules streamline observation, reflection, and action, while the Manager Agent maintains performance and quality.
Figure 3: The architecture of PEGS framework includes various routing options, distinguished by colored connecting lines. Input stickers undergo joint encoding by an image encoder, Q-Former, and a linear layer, with Vicuna serving as the language model. The output of the LLM activates two sets of tokens differently across model versions: one for image retrieval and the other as a textual condition. Subsequently, the frozen image decoder generates images.
Figure 4: The chart of emotional distribution in the choice of stickers between the User and the System.
Figure 5: Emotion distribution of user profile in Agent for S TICKERC ONV.
...and 19 more figures

STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

TL;DR

Abstract

STICKERCONV: Generating Multimodal Empathetic Responses from Scratch

Authors

TL;DR

Abstract

Table of Contents

Figures (24)