Table of Contents
Fetching ...

Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

Joé Napolitano, Pascal Nguyen

Abstract

Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

Abstract

Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

Paper Structure

This paper contains 27 sections, 8 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Overview of the GMMD pipeline. Images from the anchor and evaluation sets are passed through a pretrained backbone. Per-pixel (CNN) or per-patch (Transformer) Gram matrices vectorized via the upper-triangular part, and compared using MMD.
  • Figure 2: Spearman $\rho$ vs. layer index for each backbone, top-7 $\gamma$ values (decreasing opacity) and best configuration ($\bigstar$).
  • Figure 3: Spearman $\rho$ vs. $\gamma$ for the 17 SD-VAE layers. Triangles: $\gamma_{\mathrm{med}}$; circles: empirical optimum.
  • Figure 4: Top-20 configurations ranked by Spearman's $\rho$ (left) and Kendall's $\tau$ (right), coloured by backbone.
  • Figure 5: Distribution of Spearman's $\rho$ across all configurations, grouped by backbone.
  • ...and 11 more figures