Table of Contents
Fetching ...

Evaluating Generative Models via One-Dimensional Code Distributions

Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou

TL;DR

This work introduces Codebook Histogram Distance (CHD), a training-free distribution metric in token space, and Code Mixture Model Score (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences.

Abstract

Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.

Evaluating Generative Models via One-Dimensional Code Distributions

TL;DR

This work introduces Codebook Histogram Distance (CHD), a training-free distribution metric in token space, and Code Mixture Model Score (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences.

Abstract

Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
Paper Structure (15 sections, 16 equations, 5 figures, 4 tables)

This paper contains 15 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: From feature distributions to token statistics. Conventional metrics such as Fréchet Inception Distance (FID) operate on continuous semantic features and assume a Gaussian distribution in feature space (left), which makes them insensitive to appearance details (e.g., texture, style) and unreliable on non-Gaussian data such as artistic or medical images. Our approach (right) quantizes images into a discrete vocabulary of 1D tokens and compares empirical token statistics directly.
  • Figure 2: Sensitivity of Token Distributions to Image Degradation. To demonstrate how our discrete token space captures perceptual degradations, we apply 10 levels of progressive distortion to a set of 1,000 images and analyze the resulting shifts in their token distributions. As the severity of distortions like Gaussian noise or block shuffling increases (left), a small subset of perceptually-sensitive tokens exhibits consistent and predictable shifts in their distribution (middle). Our Codebook Histogram Distance (CHD) effectively aggregates these subtle changes, showing a robust, monotonic increase with the degradation level across all distortion types (right).
  • Figure 3: Code Mixture Model Degradation. CMMS is trained on token sequences obtained from natural images that are progressively corrupted via uniform token injection, semantic fragment swapping, and pixel-space distortions, without any human labels.
  • Figure 4: Metric–human correlation on VisForm across models and domains. All metrics are normalized to $[0,1]$, higher is better.
  • Figure 5: Mean CHD and FID values versus sample size. CHD converges with roughly 1,000 images, while FID needs over 10,000 samples to stabilize.