Table of Contents
Fetching ...

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

Kilichbek Haydarov, Xiaoqian Shen, Avinash Madasu, Mahmoud Salem, Li-Jia Li, Gamaleldin Elsayed, Mohamed Elhoseiny

TL;DR

This work targets how emotions emerge in visually grounded conversations by introducing AffectVisDial, a large-scale dataset of 50K dialogues with emotion attributions and explanations grounded to WikiArt visuals. It defines three subtasks — dialog-based question answering, dialog-based emotion prediction, and affective explanation generation — and benchmarks a range of neural baselines, including discriminative VisDial variants and multi-modal/LLM-based models. Results show discriminative models excel in QA, while dialog content and image descriptions enhance emotion explanations; zero-shot LLMs benefit from dialog context but fine-tuned models achieve stronger, more emotionally aligned explanations, with human studies confirming reasonable and human-like outputs. The dataset provides a valuable resource for advancing emotion-aware AI in vision-language contexts and suggests exciting directions for emotionally guided dialogue and collaborative image editing applications.

Abstract

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. We explain our design decisions in collecting the dataset and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We train and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page is available at https://affective-visual-dialog.github.io.

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations

TL;DR

This work targets how emotions emerge in visually grounded conversations by introducing AffectVisDial, a large-scale dataset of 50K dialogues with emotion attributions and explanations grounded to WikiArt visuals. It defines three subtasks — dialog-based question answering, dialog-based emotion prediction, and affective explanation generation — and benchmarks a range of neural baselines, including discriminative VisDial variants and multi-modal/LLM-based models. Results show discriminative models excel in QA, while dialog content and image descriptions enhance emotion explanations; zero-shot LLMs benefit from dialog context but fine-tuned models achieve stronger, more emotionally aligned explanations, with human studies confirming reasonable and human-like outputs. The dataset provides a valuable resource for advancing emotion-aware AI in vision-language contexts and suggests exciting directions for emotionally guided dialogue and collaborative image editing applications.

Abstract

We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. We explain our design decisions in collecting the dataset and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We train and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page is available at https://affective-visual-dialog.github.io.
Paper Structure (12 sections, 10 figures, 6 tables)

This paper contains 12 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: AffectVisDial captures constructed emotion attributions and explanations from both the Questioner (without image access) and Answerer (with image access) after 10 turns of questions and answers starting from two opposing opinions. Subsequently, the Questioner views the image and can alter their initial emotional response, accompanied by a corresponding textual explanation.
  • Figure 2: Visual stimuli from diverse movements including photography; dialog counts and percentages are in parentheses.
  • Figure 3: An example sample from AffectVisDial dataset. It contains an image, two opinions (positive and negative) about the image, a conversation, and explanations from both Questioner and Answerer.
  • Figure 4: Distribution of first n-grams for AffectVisDial (a) questions, (b) answers. (c) Questioner's emotion distribution before observing the hidden image, along with the most affective words for each specific emotion from the dialogue.
  • Figure 5: Distribution of lengths for questions, answers, and explanations
  • ...and 5 more figures