Table of Contents
Fetching ...

GalleryGPT: Analyzing Paintings with Large Multimodal Models

Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen

TL;DR

This work tackles the gap in AI art analysis by shifting from recognition or knowledge retrieval to visual-centric formal analysis of paintings. It introduces the PaintingForm dataset (approximately $19{,}000$ images and $50{,}000$ analyses) and a fine-tuned GalleryGPT based on a ShareGPT4V-7B/LLaVA backbone that emphasizes visual features. Through supervised fine-tuning on PaintingForm, GalleryGPT achieves state-of-the-art performance on formal analysis generation and zero-shot downstream art tasks across AQUA, ArtQuest, and ArtBench, while maintaining strong multimodal capabilities. The study demonstrates the potential of using large multimodal models for professional art analysis and highlights future directions for broader art forms and interactive art-assistance tools.

Abstract

Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: https://github.com/steven640pixel/GalleryGPT.

GalleryGPT: Analyzing Paintings with Large Multimodal Models

TL;DR

This work tackles the gap in AI art analysis by shifting from recognition or knowledge retrieval to visual-centric formal analysis of paintings. It introduces the PaintingForm dataset (approximately images and analyses) and a fine-tuned GalleryGPT based on a ShareGPT4V-7B/LLaVA backbone that emphasizes visual features. Through supervised fine-tuning on PaintingForm, GalleryGPT achieves state-of-the-art performance on formal analysis generation and zero-shot downstream art tasks across AQUA, ArtQuest, and ArtBench, while maintaining strong multimodal capabilities. The study demonstrates the potential of using large multimodal models for professional art analysis and highlights future directions for broader art forms and interactive art-assistance tools.

Abstract

Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: https://github.com/steven640pixel/GalleryGPT.
Paper Structure (23 sections, 12 figures, 3 tables)

This paper contains 23 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: An example of existing LMMs for art analysis. The painting (a nameless painting to avoid the knowledge memorized in LLMs) on the left-top corner is the target image to be analyzed, and the bottom one, "The Astronomer" by Johannes Vermeer, is wrongly recognized by GPT-4V. The red texts indicate wrong analysis, mainly belong to wrong recognition by Gemini and GPT-4V. The green texts are good parts for "formal analysis". The purple texts denote the factual description of the painting content, which is more similar with the image captioning task.
  • Figure 2: The overall pipeline of constructing our PaintingForm collection. Our annotation process only depends on the language model, without vision information. The prompt illustrated here just a simple version for exhibition, we actually elaborately designed the prompts.
  • Figure 3: An statistic illustration of the distribution of paintings by each artist in Most Popular Artists. Valid paintings denote the reserved paintings after the filtering rules in Section \ref{['sec:img_source']}, and the invalid paintings are the discarded ones. We finally filter 769 and reserve 18,026 paintings for this part (not include the 500 most popular paintings). We do not include the 500 paintings for illustration because: 1) all the 500 popular paintings are reserved, and 2) there exist about 150 artists have only one paintings in this gallery, resulting in severe long-tail distribution here.
  • Figure 4: Statistics of elements of formal analysis. Only very few paintings (8 in total) contain the elements Scale and Proportion.
  • Figure 5: An example for qualitative comparison of formal analysis generation by several powerful LMMs. Purple texts denote the factual content description, and the Blue texts are for formal analysis. The formal analysis generated by our GalleryGPT covers more visual elements, e.g., color, light and shadow, depth, composition, and perspective, than other LLMs, even the powerful GPT-4V.
  • ...and 7 more figures