GalleryGPT: Analyzing Paintings with Large Multimodal Models
Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen
TL;DR
This work tackles the gap in AI art analysis by shifting from recognition or knowledge retrieval to visual-centric formal analysis of paintings. It introduces the PaintingForm dataset (approximately $19{,}000$ images and $50{,}000$ analyses) and a fine-tuned GalleryGPT based on a ShareGPT4V-7B/LLaVA backbone that emphasizes visual features. Through supervised fine-tuning on PaintingForm, GalleryGPT achieves state-of-the-art performance on formal analysis generation and zero-shot downstream art tasks across AQUA, ArtQuest, and ArtBench, while maintaining strong multimodal capabilities. The study demonstrates the potential of using large multimodal models for professional art analysis and highlights future directions for broader art forms and interactive art-assistance tools.
Abstract
Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: https://github.com/steven640pixel/GalleryGPT.
