Table of Contents
Fetching ...

Learning to Evaluate the Artness of AI-generated Images

Junyu Chen, Jie An, Hanjia Lyu, Christopher Kanan, Jiebo Luo

TL;DR

This work introduces ArtScore, a reference-free, instance-level metric for evaluating the artness of AI-generated images. It constructs a pseudo-annotated dataset by transferring photorealistic StyleGAN2 models to artistic styles and generating interpolations between them, with artness controlled by the interpolation weight $\alpha$, then trains a neural network with a learn-to-rank objective (ListMLE) to predict relative artness. Empirical results show ArtScore aligns more closely with human artistic judgments than Gram Loss or ArtFID and improves artness ranking when combined with other metrics, validating its usefulness for evaluating and guiding art-focused image generation. The framework offers a scalable objective tool for researchers and artists to quantify and compare AI-generated art qualities, with potential integration into model training and sampling pipelines.

Abstract

Assessing the artness of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore align more closely with human artistic evaluation than existing evaluation metrics, such as Gram loss and ArtFID.

Learning to Evaluate the Artness of AI-generated Images

TL;DR

This work introduces ArtScore, a reference-free, instance-level metric for evaluating the artness of AI-generated images. It constructs a pseudo-annotated dataset by transferring photorealistic StyleGAN2 models to artistic styles and generating interpolations between them, with artness controlled by the interpolation weight , then trains a neural network with a learn-to-rank objective (ListMLE) to predict relative artness. Empirical results show ArtScore aligns more closely with human artistic judgments than Gram Loss or ArtFID and improves artness ranking when combined with other metrics, validating its usefulness for evaluating and guiding art-focused image generation. The framework offers a scalable objective tool for researchers and artists to quantify and compare AI-generated art qualities, with potential integration into model training and sampling pipelines.

Abstract

Assessing the artness of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore align more closely with human artistic evaluation than existing evaluation metrics, such as Gram loss and ArtFID.
Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The proposed framework consists of three steps: 1) StyleGAN transfer, 2) interpolated image generation, and 3) ArtScore training. The first two steps generate a large dataset containing images with varying levels of artness. The third step learns to distinguish different levels of artness from the developed dataset.
  • Figure 2: We use (a) FreezeG and (b) a few-shot adaptation method to adapt the photorealistic StyleGAN to artistic styles. In the FreezeG setting (a), we fine-tune the photorealistic model using the entire art painting dataset while freezing the early convolution layers. In the few-shot adaptation setting (b), we cluster the art paintings by their styles and sample representative paintings to fine-tune the entire photorealistic model using only a few representative examples.
  • Figure 3: StyleGAN model interpolation and interpolated image generation. We linearly interpolate the low-level convolution layers of photorealistic and artistic StyleGAN models. Real photos and artworks are projected into the latent space of respective models to retrieve latent codes. These codes are used as the input of the interpolated models to generate sequences of images with varied levels of artness.
  • Figure 4: Effect of fusing all blocks (top) or the last blocks (bottom) of the photorealistic and artistic models. Note that the structure of the horse is better preserved if we only fuse the low-level layers.
  • Figure 5: The correlation between the user study rankings from wright2022artfid and rankings induced by content preservation and style matching metrics, both with and without our proposed ArtScore metric. We tested three aggregation methods denoted by Eq. \ref{['equ:rank']}, Eq. \ref{['equ:add']}, and Eq. \ref{['equ:mul']}. Lighter shades represent results without ArtScore, while darker shades represent results with ArtScore. In general, incorporating ArtScore consistently improves the accuracy of image artness ranking across all aggregation methods.
  • ...and 1 more figures