Towards Artwork Explanation in Large-scale Vision Language Models
Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
TL;DR
This paper tackles artwork explanation generation for large-scale vision-language models by introducing a dedicated task and a Wikipedia-derived dataset to quantify knowledge integration. It defines three metrics—$EC$, $EF_1$, and $ECooc$—to measure entity coverage, factual precision/recall, and inter-entity coherence in generated explanations. Through experiments with multiple LVLMs and instruction-tuned baselines, the authors show that base language models retain artwork knowledge but vision-language integration often misaligns this knowledge with visual input, especially when explanations must be produced from images alone. The work highlights data-source biases, gaps in human evaluation, and proposes directions such as external knowledge sources and retrieval-augmented generation to enhance knowledge-grounded explanations in LVLMs.
Abstract
Large-scale Vision-Language Models (LVLMs) output text from images and instructions, demonstrating capabilities in text generation and comprehension. However, it has not been clarified to what extent LVLMs possess the ability to understand the knowledge necessary for explaining images, the complex relationships between various pieces of knowledge, and how they integrate these understandings into their explanations. To address this issue, we propose a new task: the artwork explanation generation task, along with its evaluation dataset and metrics for quantitatively assessing the understanding and utilization of knowledge about artworks. This task is apt for image description based on the premise that LVLMs are expected to have pre-existing knowledge of artworks, which are often subjects of wide recognition and documented information. It consists of two parts: generating explanations from images and titles of artworks, and generating explanations using only images, thus evaluating the LVLMs' language-based and vision-based knowledge. Alongside, we release a training dataset for LVLMs to learn explanations that incorporate knowledge about artworks. Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone.
