Table of Contents
Fetching ...

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Jing Zhang, Liang Zheng, Meng Wang, Dan Guo

TL;DR

This work tackles visual-art emotion understanding by predicting an emotion category and generating emotion-grounded explanations. It introduces SEVLM, a small vision-language model that incorporates Valence-Arousal-Dominance (VAD) emotion modeling into input embeddings and loss, plus a VAD head and a ternary contrastive head to align image, emotion, and explanation. Empirical results on ArtEmis v1.0 and v2.0 show SEVLM outperforms state-of-the-art small models and remains competitive with fine-tuned large models like LLaVA-FT and GPT4(V), all while running efficiently on a single RTX 2080 Ti. The work demonstrates that integrating expert psychological knowledge into compact models can yield subjective, human-aligned explanations with strong practical impact and broad applicability beyond artworks.

Abstract

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.

Training A Small Emotional Vision Language Model for Visual Art Comprehension

TL;DR

This work tackles visual-art emotion understanding by predicting an emotion category and generating emotion-grounded explanations. It introduces SEVLM, a small vision-language model that incorporates Valence-Arousal-Dominance (VAD) emotion modeling into input embeddings and loss, plus a VAD head and a ternary contrastive head to align image, emotion, and explanation. Empirical results on ArtEmis v1.0 and v2.0 show SEVLM outperforms state-of-the-art small models and remains competitive with fine-tuned large models like LLaVA-FT and GPT4(V), all while running efficiently on a single RTX 2080 Ti. The work demonstrates that integrating expert psychological knowledge into compact models can yield subjective, human-aligned explanations with strong practical impact and broad applicability beyond artworks.

Abstract

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). The code is available at https://github.com/BetterZH/SEVLM-code.
Paper Structure (17 sections, 8 equations, 10 figures, 4 tables)

This paper contains 17 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples comparing different methods of predicting emotion class and explaining why this emotion is evoked given an art image on both ArtEmis v1.0 test set and ArtEmis v2.0 Combined test set. Three models are compared: SAT achlioptas2021artemis, NLX-GPT2 sammani2022nlx, and our method. In both examples, the explanations from existing methods are misaligned with the emotion label or the art image, but our method gives superior results. Green fonts indicate incorrect emotion classification results; red texts indicate large discrepancies between the semantics of explanations and visual content; blue texts denote that the emotion of the explanations does not correspond to the predicted category. Our design in \ref{['sec: Proposed Improvements']} aims to alleviate these problems.
  • Figure 2: Overview and comparison between the baseline model (a) and the proposed model (b). The gray boxes in (a) and (b) are the same: an art image, a prompt, and an explanation are used as input to GPT2 and then a language head, which will output an emotion class and corresponding explanations. The green boxes in (b) denote our technical contributions. We use a VAD dictionary to provide emotion features which is complementary to the standard text embeddings. Moreover, we design a VAD head and a contrastive head to facilitate emotion learning and feature alignment among the image, emotion class and explanation, respectively.
  • Figure 3: Depicting VAD vectors of example words in the 3-dim space.
  • Figure 4: Detailed network structure of the proposed small emotional vision language model. It has: 1) a vision language backbone including an image encoder, a samll language model (SLM) GPT2 decoder, and a language head; 2) VAD emotion modeling introducing emotion knowledge VAD into text embeddings to enhance model capacity of understanding emotion; 3) a VAD head to learn VAD-aware emotion; and 4) a contrastive head to force the features alignment among image, emotion label and explanation. During training, we use the emotion label and explanation as ground truth. In inference, we use the prompt 'The emotion is _' and an art image as input and generate the emotion label and explanations.
  • Figure 5: Ablation study of the three components on the ArtEmis v1.0 test set (a) and ArtEmis v2.0 combined test set(b). We also perform statistical tests with p-value on B4 and M metrics, where the p-value is a statistic used to evaluate whether the difference in performance between two methods is significant. 'n.s.' means the differences is not statistically significant (i.e., p-value > 0.05). $\ast$ denotes statistically significant (i.e., 0.01 < p-value < 0.05). $\ast\ast$ and $\ast\ast\ast$ mean statistically very significant (i.e., 0.001 < p-value < 0.01) and statistically extremely significant (i.e., p-value < 0.001), respectively.
  • ...and 5 more figures