Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
Kilichbek Haydarov, Xiaoqian Shen, Avinash Madasu, Mahmoud Salem, Li-Jia Li, Gamaleldin Elsayed, Mohamed Elhoseiny
TL;DR
This work targets how emotions emerge in visually grounded conversations by introducing AffectVisDial, a large-scale dataset of 50K dialogues with emotion attributions and explanations grounded to WikiArt visuals. It defines three subtasks — dialog-based question answering, dialog-based emotion prediction, and affective explanation generation — and benchmarks a range of neural baselines, including discriminative VisDial variants and multi-modal/LLM-based models. Results show discriminative models excel in QA, while dialog content and image descriptions enhance emotion explanations; zero-shot LLMs benefit from dialog context but fine-tuned models achieve stronger, more emotionally aligned explanations, with human studies confirming reasonable and human-like outputs. The dataset provides a valuable resource for advancing emotion-aware AI in vision-language contexts and suggests exciting directions for emotionally guided dialogue and collaborative image editing applications.
Abstract
We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. We explain our design decisions in collecting the dataset and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We train and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page is available at https://affective-visual-dialog.github.io.
