Table of Contents
Fetching ...

Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Jin Kim, Byunghwee Lee, Taekho You, Jinhyuk Yun

TL;DR

This study uses cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings and reveals that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements.

Abstract

The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

TL;DR

This study uses cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings and reveals that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements.

Abstract

The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Understanding the Evolution of Western Paintings with Image Embedding (A) The evolution of paintings has paralleled human evolution through historical events, technological advancements, and cultural developments, yet contextual information is commonly disregarded in data-scientific approaches to art history. (B--G) A two-dimensional (2D) projection of 72,447 Western paintings was obtained using Uniform Manifold Approximation and Projection (UMAP) with the encoded vectors mcinnes2018umap, where each dot represents a painting. To emphasize the importance of contextual content in Western paintings, we encoded the paintings using an (B, D, F) autoencoder (A-vector) and (C, E, G) CLIP (C-vector) kingma2019introductionradford2021learning. Note that dot colors indicate painting years. A-vectors show mixed distribution, making it difficult to distinguish painting years (B), while C-vectors effectively differentiate them (C). For instance, red-bordered paintings sampled from different periods are clustered in the center of the A-vector space but are well-separated in the C-vector space. This pattern holds for (D and E) the 10 most frequent style periods and (F and G) 10 seminal artists (H and I). The year displayed next to the artist name is their death year, whereas those next to the style period is the most frequent painting year. To test the expressibility of each vector, we train regression models using XGBoost chen2016xgboost to predict painting years from embedded vectors (see Materials and Methods). Here, solid lines are drawn with locally weighted regression cleveland1979robust. A-vectors demonstrate limited predictability ($R^2=0.203$, Pearson $\rho=0.451$), while C-vectors show remarkably high predictability ($R^2=0.866$, Pearson $\rho=0.931$).
  • Figure 2: Principal Component Analysis (PCA) Reveals Latent Information Encoded in Embedded Vectors. To explore the expressibility gap between A- and C-vectors, we conducted PCA to extract principal information from each vector representation. The first two components of each vector space revealed that (A) A-vectors show minimal variation, while (B) C-vectors demonstrate substantial temporal differentiation. For (A) and (B), the x-axis is bounded by the maximum and minimum projected values of paintings on the PC. (C) We obtained modified A-vectors of Munch's "The Scream" using vector analogy: $\textbf{v}_{\hbox{new}} = \textbf{v}_{\hbox{original}} + d \cdot \textbf{PC}_{\textbf{i}}$, where $\textbf{v}_{\hbox{original}}$ represents the embedded vector of "The Scream" and $\textbf{PC}_{\textbf{i}}$ denotes the normalized i-th PC vector. Images were then generated using the SDM's autoencoder with these modified vectors rombach2022high. The resulting images show that the first four PCs of the A-vector primarily represent visual composition elements: 1) brightness, 2) vertical brightness composition, 3) hue (blue to yellow), and 4) highlight distribution. The original vector of "The Scream" is placed near the zero point, while Vermeer's "Girl with a Pearl Earring" has a high PC1 value. (D) For comparison, we retrieved paintings based on their C-vector's projected values and distances on each PC (see Materials and Methods for detailed selection criteria), as CLIP's encoder-only architecture lacks image generation ability. We observed that images mainly vary in context along the PCs, as exemplified by transitioning from portraits to abstract compositions in PC1.
  • Figure 3: Generative Prompt Outlines Temporal Evolution of Western Art. (A) Religious words show decreasing trends over time. (B) Similarly, human subject descriptors also gradually decreased. (C) Comparative analysis of artistic style descriptors abstract and portrait, highlighting the transition in representational approaches. (D) Natural features abruptly increased around the 1800s, likely reflecting the development of tube colors and steam locomotive trains enhancing the mobility of painters. (E) We also observed notable changes in transportation-related keywords. For example, the term train steeply increased following the invention of steam locomotives in the 1800s. (F) The rise of simple color words indicates an evolution towards an abstract style in the early 1900s. These findings highlight how multimodal AI models can effectively illustrate shifts in artistic expression and content in response to technological, societal, and stylistic changes throughout history.
  • Figure 4: Replicating Evolutionary Trajectories of Western Paintings. To verify our findings on the role of context in understanding the evolution of art, we designed a simple generative experiment with an image-to-image diffusion model using different prompt guidance: one used a null prompt (random-diffusion), while the other used representative keywords of the next century, e.g., keywords of the 1800s for the paintings of the 1700s (future-directed; see Materials and Methods). (A)--(D) Using the generated images, we first estimated the painting year for each using the C-vector regression model shown in Fig. \ref{['fig:fig1']}, which showed that the future-directed images (red lines) tend to be predicted consistently $\simeq 100$ years ahead of their source painting, demonstrating a systematic temporal shift in artistic characteristics. In contrast, the random-diffused images (blue lines) showed a distinctive convergence towards ${\sim}1900$ as the diffusion steps increased. Here, solid lines are drawn with locally weighted regression cleveland1979robust, and steps control the relative noise level, where larger steps reduce the consistency with the original image by applying stronger noise (see Materials and Methods). (E) Sampled images for each period, showing that diffusion steps do not alter the large formal elements of paintings for both random-diffused (blue-boxed) and future-directed (red-boxed) images. (F) Spectrum of the paintings. The color represents the original period of the painting, and the position reflects the projected value of C-vector onto the temporal axis $\textbf{v}_{1500s \rightarrow 1900s}$ for original (black-boxed), random-diffused (blue-boxed), and future-directed (red-boxed) images at step 1 (see Materials and Methods). For all panels, we randomly sampled 500 images from each period to reduce bias from data imbalance.