Table of Contents
Fetching ...

Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols

Sebastian Padó, Kerstin Thomas

TL;DR

The paper investigates how current vision-language models interpret emotions and emotion symbols in artworks. Using a case study with 38 artworks and three open-weight VLMs, it prompts models with eight questions and evaluates outputs qualitatively by experts. Findings show reliable content and some emotion recognition for concrete works, but significant challenges for abstract or symbolic imagery and for symbol interpretation, with notable cross-model inconsistency. The study demonstrates potential for scalable, humanities-informed reading of artworks while underscoring the need for expert oversight and further model fine-tuning.

Abstract

Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.

Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols

TL;DR

The paper investigates how current vision-language models interpret emotions and emotion symbols in artworks. Using a case study with 38 artworks and three open-weight VLMs, it prompts models with eight questions and evaluates outputs qualitatively by experts. Findings show reliable content and some emotion recognition for concrete works, but significant challenges for abstract or symbolic imagery and for symbol interpretation, with notable cross-model inconsistency. The study demonstrates potential for scalable, humanities-informed reading of artworks while underscoring the need for expert oversight and further model fine-tuning.

Abstract

Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.

Paper Structure

This paper contains 18 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Examples of three difficult artworks. Left: The blast (Corot, Picture 15 in Appendix A). Center: Medea furious (Delacroix, Picture 21). Melancholia I (Dürer, Picture 27).