Table of Contents
Fetching ...

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

Zhengqing Yuan, Yunhong He, Kun Wang, Yanfang Ye, Lichao Sun

TL;DR

ArtGPT-4 tackles artistic-understanding in vision-language models by introducing parameter-efficient adapters to an LLM, avoiding full fine-tuning. It uses two strategies: repurposing the LLM's self-attention for visual tokens (I-MHA) and adding Image Adapters to align image and text representations. Trained on a small dataset in a short time, it achieves state-of-the-art results on ArtEmis, ArtEmis-v2.0, and ArtMM benchmarks, closely matching human artist descriptions. The work also provides a novel ArtMM evaluation suite and demonstrates strong performance across multiple backbones, highlighting practical efficiency for artistic AI applications.

Abstract

The success of large language models (LLMs) has inspired an emerging research field of multimodal learning. However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters. To tackle this challenge, models such as MiniGPT-4 and LLaVA have been developed to fine-tune the pre-trained models using fewer parameters. Despite their promising performance, these models remain limited in their understanding of artistic imagery. To facilitate better artistic-understanding, in this paper, we propose ArtGPT-4, a pioneering large vision-language model tailored to address the limitations of existing models in artistic comprehension. The key innovation of ArtGPT-4 lies in its craft for the sophisticated challenge of artistic image comprehension, setting it apart from other models that overlook fine details for broader themes. Specifically, it works by integrating some specialized adapter layers into the LLM, enabling the model to more efficiently and effectively parse and interpret complex visual tokens, instead of fine-tuning the whole LLM as in the existing method. ArtGPT-4 has demonstrated its outstanding performance on the efficiency: utilizing a Tesla A100 device, its training can be completed in mere 2 hours with an image-text pair dataset comprising approximately 0.52M entries. Additionally, ArtGPT-4 has also achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets as well as the benchmarks established in this work, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The outstanding performance of ArtGPT-4 shows that it can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. The code and the pre-trained model are accessible in \url{https://github.com/DLYuanGod/ArtGPT-4}.

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter

TL;DR

ArtGPT-4 tackles artistic-understanding in vision-language models by introducing parameter-efficient adapters to an LLM, avoiding full fine-tuning. It uses two strategies: repurposing the LLM's self-attention for visual tokens (I-MHA) and adding Image Adapters to align image and text representations. Trained on a small dataset in a short time, it achieves state-of-the-art results on ArtEmis, ArtEmis-v2.0, and ArtMM benchmarks, closely matching human artist descriptions. The work also provides a novel ArtMM evaluation suite and demonstrates strong performance across multiple backbones, highlighting practical efficiency for artistic AI applications.

Abstract

The success of large language models (LLMs) has inspired an emerging research field of multimodal learning. However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters. To tackle this challenge, models such as MiniGPT-4 and LLaVA have been developed to fine-tune the pre-trained models using fewer parameters. Despite their promising performance, these models remain limited in their understanding of artistic imagery. To facilitate better artistic-understanding, in this paper, we propose ArtGPT-4, a pioneering large vision-language model tailored to address the limitations of existing models in artistic comprehension. The key innovation of ArtGPT-4 lies in its craft for the sophisticated challenge of artistic image comprehension, setting it apart from other models that overlook fine details for broader themes. Specifically, it works by integrating some specialized adapter layers into the LLM, enabling the model to more efficiently and effectively parse and interpret complex visual tokens, instead of fine-tuning the whole LLM as in the existing method. ArtGPT-4 has demonstrated its outstanding performance on the efficiency: utilizing a Tesla A100 device, its training can be completed in mere 2 hours with an image-text pair dataset comprising approximately 0.52M entries. Additionally, ArtGPT-4 has also achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets as well as the benchmarks established in this work, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The outstanding performance of ArtGPT-4 shows that it can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. The code and the pre-trained model are accessible in \url{https://github.com/DLYuanGod/ArtGPT-4}.
Paper Structure (30 sections, 8 equations, 4 figures, 13 tables)

This paper contains 30 sections, 8 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Comparison between different structures of multimodal models. All of these methods are trained in a two-stage fashion. Stage 1 stands for pre-training and Stage 2 represents instruction tuning.
  • Figure 2: ArtGPT-4 exhibits a remarkable ability of artistic-understanding. It extends beyond merely capturing the artistic details of an image, delving into the realm of emotional understanding. ArtGPT-4 is capable of discerning and articulating the emotions elicited by an image like a human being, such as feelings of positivity and inspiration.
  • Figure 3: We show how we adapt the LLM (c) of Multimodal Model Structure (b) using the Adapter Efficient Fine-tuning method in NLP to model the ArtGPT-4 (e). During training, only newly added Image Adapters (a) and partial normalization layer (e) are updated while all the other layers are frozen.
  • Figure 4: This image is a visual example of the ArtMM dataset, which is a compilation of various artworks and photographs. It includes a wide range of subjects like wildlife, abstract and traditional art, landscapes, and imaginative fantasy scenes. The collection also incorporates elements from popular culture and cinematic stills.