Towards Artwork Explanation in Large-scale Vision Language Models

Kazuki Hayashi; Yusuke Sakai; Hidetaka Kamigaito; Katsuhiko Hayashi; Taro Watanabe

Towards Artwork Explanation in Large-scale Vision Language Models

Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

TL;DR

This paper tackles artwork explanation generation for large-scale vision-language models by introducing a dedicated task and a Wikipedia-derived dataset to quantify knowledge integration. It defines three metrics—$EC$, $EF_1$, and $ECooc$—to measure entity coverage, factual precision/recall, and inter-entity coherence in generated explanations. Through experiments with multiple LVLMs and instruction-tuned baselines, the authors show that base language models retain artwork knowledge but vision-language integration often misaligns this knowledge with visual input, especially when explanations must be produced from images alone. The work highlights data-source biases, gaps in human evaluation, and proposes directions such as external knowledge sources and retrieval-augmented generation to enhance knowledge-grounded explanations in LVLMs.

Abstract

Large-scale Vision-Language Models (LVLMs) output text from images and instructions, demonstrating capabilities in text generation and comprehension. However, it has not been clarified to what extent LVLMs possess the ability to understand the knowledge necessary for explaining images, the complex relationships between various pieces of knowledge, and how they integrate these understandings into their explanations. To address this issue, we propose a new task: the artwork explanation generation task, along with its evaluation dataset and metrics for quantitatively assessing the understanding and utilization of knowledge about artworks. This task is apt for image description based on the premise that LVLMs are expected to have pre-existing knowledge of artworks, which are often subjects of wide recognition and documented information. It consists of two parts: generating explanations from images and titles of artworks, and generating explanations using only images, thus evaluating the LVLMs' language-based and vision-based knowledge. Alongside, we release a training dataset for LVLMs to learn explanations that incorporate knowledge about artworks. Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone.

Towards Artwork Explanation in Large-scale Vision Language Models

TL;DR

, and

—to measure entity coverage, factual precision/recall, and inter-entity coherence in generated explanations. Through experiments with multiple LVLMs and instruction-tuned baselines, the authors show that base language models retain artwork knowledge but vision-language integration often misaligns this knowledge with visual input, especially when explanations must be produced from images alone. The work highlights data-source biases, gaps in human evaluation, and proposes directions such as external knowledge sources and retrieval-augmented generation to enhance knowledge-grounded explanations in LVLMs.

Abstract

Paper Structure (48 sections, 4 equations, 8 figures, 18 tables)

This paper contains 48 sections, 4 equations, 8 figures, 18 tables.

Introduction
LVLMs
Task and Evaluation Metrics
Task
With Title
Without Title
Evaluation Metrics
Entity Coverage
Entity F1
Entity Cooccurrence
Dataset Creation
STEP 1:
STEP 2:
STEP 3:
STEP 4:
...and 33 more sections

Figures (8)

Figure 1: An example of creative assistance using an LVLM, harnessing comprehensive artistic knowledge for guidance.
Figure 2: Workflow diagram illustrating the methodology for dataset creation from Wikipedia articles on artworks, involving selection, filtering, data balancing, and instructional templating for LVLM training and evaluation.
Figure 3: Average token lengths for 18 evaluated LVLMs on an unseen set, where yellow represents the 'With Title' setting, bleu indicates the 'Without Title' setting, and red signifies the average token length for the base language model of the LVLM with titles. The length of the unseen reference sentence is 174 tokens.
Figure 4: Average token lengths for Qwen's Few-shot and Fine-tuning settings on an unseen set, where yellow represents the 'With Title' setting, bleu indicates the 'Without Title' setting, and red signifies the average token length for the base language model of the LVLM with titles. The length of the unseen reference sentence is 174 tokens.
Figure 5: Train set format with title.
...and 3 more figures

Towards Artwork Explanation in Large-scale Vision Language Models

TL;DR

Abstract

Towards Artwork Explanation in Large-scale Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)