VAEER: Visual Attention-Inspired Emotion Elicitation Reasoning
Fanhang Man, Xiaoyue Chen, Huandong Wang, Baining Zhao, Han Li, Xinlei Chen
TL;DR
VAEER addresses multi-label visual emotion elicitation by decomposing images into visual focus, interactions, and context, grounding cues with a structured affective knowledge graph, and performing per-emotion arousal reasoning. The three-stage framework (VAM, MME-RAG, PAR) achieves state-of-the-art results on three benchmarks, with up to 19% per-emotion improvements and 12.3% avg-F1 gains over strong baselines. Qualitative analyses show interpretable intermediate reasoning and plausible alignment with human judgments. The work emphasizes cognitive grounding and interpretability as essential for scalable, responsible emotion-aware visual analysis in online and crisis contexts.
Abstract
Images shared online strongly influence emotions and public well-being. Understanding the emotions an image elicits is therefore vital for fostering healthier and more sustainable digital communities, especially during public crises. We study Visual Emotion Elicitation (VEE), predicting the set of emotions that an image evokes in viewers. We introduce VAEER, an interpretable multi-label VEE framework that combines attention-inspired cue extraction with knowledge-grounded reasoning. VAEER isolates salient visual foci and contextual signals, aligns them with structured affective knowledge, and performs per-emotion inference to yield transparent, emotion-specific rationales. Across three heterogeneous benchmarks, including social imagery and disaster-related photos, VAEER achieves state-of-the-art results with up to 19% per-emotion improvements and a 12.3% average gain over strong CNN and VLM baselines. Our findings highlight interpretable multi-label emotion elicitation as a scalable foundation for responsible visual media analysis and emotionally sustainable online ecosystems.
