Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

Jaehui Hwang; Junghyuk Lee; Jong-Seok Lee

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

Jaehui Hwang, Junghyuk Lee, Jong-Seok Lee

TL;DR

The paper addresses the challenge of evaluating image-generative quality beyond feature-distance metrics that poorly align with human perception. It introduces two metrics, the anomaly score ($AS$) and the anomaly score for individual images ($AS_i$), grounded in two representation-space properties: complexity and vulnerability; AS uses a 2D Kolmogorov–Smirnov distance between joint distributions of ($C,V$) for real and generated data. Empirical results show that AS and AS_i correlate more strongly with human judgments than prior metrics like FID, across multiple datasets and feature models, and that AS_i can capture per-image naturalness with high fidelity. The work provides a practical framework for robust, human-aligned evaluation of both entire generative-model outputs and individual generated images, with implications for model development and benchmarking in synthetic image generation.

Abstract

With the advancement of generative models, the assessment of generated images becomes more and more important. Previous methods measure distances between features of reference and generated images from trained vision models. In this paper, we conduct an extensive investigation into the relationship between the representation space and input space around generated images. We first propose two measures related to the presence of unnatural elements within images: complexity, which indicates how non-linear the representation space is, and vulnerability, which is related to how easily the extracted feature changes by adversarial input changes. Based on these, we introduce a new metric to evaluating image-generative models called anomaly score (AS). Moreover, we propose AS-i (anomaly score for individual images) that can effectively evaluate generated images individually. Experimental results demonstrate the validity of the proposed approach.

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

TL;DR

The paper addresses the challenge of evaluating image-generative quality beyond feature-distance metrics that poorly align with human perception. It introduces two metrics, the anomaly score (

) and the anomaly score for individual images (

), grounded in two representation-space properties: complexity and vulnerability; AS uses a 2D Kolmogorov–Smirnov distance between joint distributions of (

) for real and generated data. Empirical results show that AS and AS_i correlate more strongly with human judgments than prior metrics like FID, across multiple datasets and feature models, and that AS_i can capture per-image naturalness with high fidelity. The work provides a practical framework for robust, human-aligned evaluation of both entire generative-model outputs and individual generated images, with implications for model development and benchmarking in synthetic image generation.

Abstract

Paper Structure (24 sections, 7 equations, 16 figures, 7 tables)

This paper contains 24 sections, 7 equations, 16 figures, 7 tables.

Introduction
Related works
Generative models
Evaluation of generative models
Analyzing representation space around generated images
Complexity
Vulnerability
Evaluating generative models
Anomaly score for generative models
Experiments
Evaluating individual generated images
Anomaly score for individual generated images
Subjective test
Conclusion
Employed generative models
...and 9 more sections

Figures (16)

Figure 1: Proposed AS for evaluating generative models and AS-i for individual images. The graph on the top shows the proposed AS aligns well with the human perception of evaluating various generative models trained on the FFHQ dataset. On the bottom, several generated images are shown with our AS-i score, rarity score rarity, realism score IPR, and human evaluation. A value of the human evaluation indicates the proportion of participants who assess that the image is a natural image in our subjective test. The best score for each metric, indicating an image to be the most natural image is highlighted in blue. In terms of naturalness, AS-i shows the best alignment with human evaluation. On the other hand, the rarity score prefers the second image, which is unnatural, as the most common in real images. The realism score also overestimates the leftmost image to be the most realistic.
Figure 2: Tendency of linearity around real data. We compute the change in the linearity of representation spaces developed by ConvNeXt-tiny and DINO-V2 when random noise is added to real images from the ImageNet dataset.
Figure 3: Distribution of complexity. The cumulative distribution function (CDF) of complexity is depicted for various feature models and generative models trained on FFHQ. Each row shows distributions for each feature model: ViT-S (left), ConvNeXt-tiny (mid), and DINO-V2 (right). Each column indicates a different type of generative model: InsGen insgen (top) and StyleNAT stylenat (bottom). Note that InsGen is assessed as a low-performance model compared to StyleNAT by the human evaluation dgm-eval.
Figure 4: Image components causing large changes by adversarial attack. We partition images into super-pixels and assess their contribution to feature changes by the attack. Starting from the left: the original image, and the level of contributions on the changes in the feature extracted by ViT-S vit, ConvNeXt-tiny convnext, and DINO-V2 dinov2, respectively. Red denotes a high level of impact on the changes, while blue indicates a low level of influence on the changes.
Figure 5: Distribution of vulnerability. The cumulative distribution function (CDF) of vulnerability is depicted for various feature models and generative models trained on FFHQ. Each row shows distributions for each feature model: ViT-S (left), ConvNeXt-tiny (mid), and DINO-V2 (right). Each column indicates a different type of generative model: InsGen insgen (top) and StyleNAT stylenat (bottom). Note that InsGen is assessed as a low-performance model compared to StyleNAT by the human evaluation dgm-eval.
...and 11 more figures

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

TL;DR

Abstract

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability

Authors

TL;DR

Abstract

Table of Contents

Figures (16)