Table of Contents
Fetching ...

ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report

Yixiao Yuan, Yingzhe Peng

TL;DR

The paper tackles emotion explanation generation in visual-dialog contexts for art by deploying two complementary multimodal pipelines: LM-based (captioning via BLIP2 followed by Bart-Large fine-tuning) and LVLM-based (llava-v1.5-7b with instruction tuning). A 5-fold cross-validation framework and ensemble techniques are used to enhance emotion classification, while LVLMs contribute richer explanations. A hybrid strategy combining hard voting for emotion classification with LVLM-based explanations achieves top ICCV23 results (F1 52.36, BLEU 0.26, total 26.31), underscoring the value of multimodal integration for emotionally aware AI in art. The work demonstrates that leveraging both language and vision models can more accurately interpret and explain human emotional responses to visual art.

Abstract

The Visual-Dialog Based Emotion Explanation Generation Challenge focuses on generating emotion explanations through visual-dialog interactions in art discussions. Our approach combines state-of-the-art multi-modal models, including Language Model (LM) and Large Vision Language Model (LVLM), to achieve superior performance. By leveraging these models, we outperform existing benchmarks, securing the top rank in the ICCV23 Visual-Dialog Based Emotion Explanation Generation Challenge, which is part of the 5th Workshop On Closing The Loop Between Vision And Language (CLCV) with significant scores in F1 and BLEU metrics. Our method demonstrates exceptional ability in generating accurate emotion explanations, advancing our understanding of emotional impacts in art.

ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report

TL;DR

The paper tackles emotion explanation generation in visual-dialog contexts for art by deploying two complementary multimodal pipelines: LM-based (captioning via BLIP2 followed by Bart-Large fine-tuning) and LVLM-based (llava-v1.5-7b with instruction tuning). A 5-fold cross-validation framework and ensemble techniques are used to enhance emotion classification, while LVLMs contribute richer explanations. A hybrid strategy combining hard voting for emotion classification with LVLM-based explanations achieves top ICCV23 results (F1 52.36, BLEU 0.26, total 26.31), underscoring the value of multimodal integration for emotionally aware AI in art. The work demonstrates that leveraging both language and vision models can more accurately interpret and explain human emotional responses to visual art.

Abstract

The Visual-Dialog Based Emotion Explanation Generation Challenge focuses on generating emotion explanations through visual-dialog interactions in art discussions. Our approach combines state-of-the-art multi-modal models, including Language Model (LM) and Large Vision Language Model (LVLM), to achieve superior performance. By leveraging these models, we outperform existing benchmarks, securing the top rank in the ICCV23 Visual-Dialog Based Emotion Explanation Generation Challenge, which is part of the 5th Workshop On Closing The Loop Between Vision And Language (CLCV) with significant scores in F1 and BLEU metrics. Our method demonstrates exceptional ability in generating accurate emotion explanations, advancing our understanding of emotional impacts in art.
Paper Structure (11 sections, 1 figure, 5 tables)

This paper contains 11 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: LVLM-based method architecture