Table of Contents
Fetching ...

Fréchet Denoised Distance: Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

Jiajie Fan, Amal Trigui, Thomas Bäck, Hao Wang

TL;DR

This work designs a novel metric Fr\'echet Denoised Distance (FDD), which can effectively detect implausible structures and is more consistent with structural inspections by human experts.

Abstract

A great interest has arisen in using Deep Generative Models (DGM) for generative design. When assessing the quality of the generated designs, human designers focus more on structural plausibility, e.g., no missing component, rather than visual artifacts, e.g., noises or blurriness. Meanwhile, commonly used metrics such as Fréchet Inception Distance (FID) may not evaluate accurately because they are sensitive to visual artifacts and tolerant to semantic errors. As such, FID might not be suitable to assess the performance of DGMs for a generative design task. In this work, we propose to encode the to-be-evaluated images with a Denoising Autoencoder (DAE) and measure the distribution distance in the resulting latent space. Hereby, we design a novel metric Fréchet Denoised Distance (FDD). We experimentally test our FDD, FID and other state-of-the-art metrics on multiple datasets, e.g., BIKED, Seeing3DChairs, FFHQ and ImageNet. Our FDD can effectively detect implausible structures and is more consistent with structural inspections by human experts. Our source code is publicly available at https://github.com/jiajie96/FDD_pytorch.

Fréchet Denoised Distance: Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder

TL;DR

This work designs a novel metric Fr\'echet Denoised Distance (FDD), which can effectively detect implausible structures and is more consistent with structural inspections by human experts.

Abstract

A great interest has arisen in using Deep Generative Models (DGM) for generative design. When assessing the quality of the generated designs, human designers focus more on structural plausibility, e.g., no missing component, rather than visual artifacts, e.g., noises or blurriness. Meanwhile, commonly used metrics such as Fréchet Inception Distance (FID) may not evaluate accurately because they are sensitive to visual artifacts and tolerant to semantic errors. As such, FID might not be suitable to assess the performance of DGMs for a generative design task. In this work, we propose to encode the to-be-evaluated images with a Denoising Autoencoder (DAE) and measure the distribution distance in the resulting latent space. Hereby, we design a novel metric Fréchet Denoised Distance (FDD). We experimentally test our FDD, FID and other state-of-the-art metrics on multiple datasets, e.g., BIKED, Seeing3DChairs, FFHQ and ImageNet. Our FDD can effectively detect implausible structures and is more consistent with structural inspections by human experts. Our source code is publicly available at https://github.com/jiajie96/FDD_pytorch.
Paper Structure (19 sections, 2 equations, 9 figures, 1 table)

This paper contains 19 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: From which side (left or right) are the design images more plausible? (a) Structural implausibility; (b) visual artifacts. Recent work Baker2018DeepCNGeirhos2018ImageNettrainedCAHermann2019TheOA discovers that the SOTA metrics (FID, KID and FD$_\text{DINO-V2}$) tend to penalize visual artifacts more than structural implausibility, which matters more to the human designers. In contrast, our FDD consists better with human designers and is able to focus on shapes.
  • Figure 2: Plausibility evaluation using Fréchet Denoised Distance. Blue area and stars visualizing the distribution and samples of real data in the image space and in the DAE-encoded latent space; Orange area and stars illustrating the distribution and samples of generated data in the image space and in the DAE-encoded latent space.
  • Figure 3: Examples of manipulated images for sensitivity test. We choose the intensity of the disturbances so that images with structural errors (i.e., mask and swap) are notably less plausible than ones with visual artifacts (i.e., salt & pepper noise and Gaussian noise).
  • Figure 4: Sensitivity Comparison. The y-axis represents the score value measured by each metric, where a lower value indicates a higher similarity to source data, i.e., better quality. A reliable plausibility metric should penalize more on the basis of structural errors (e.g., mask and swap) than visual artifacts (e.g., noise). For each metric, the dashed line shows the mean across the groups and the shaded region depicts the measured values from the groups.
  • Figure 5: Experiments with FDD and other DAE-based metrics. In (a), within all DAE-based metrics, FDD shows the best performance; (b) Pearson correlation of metrics over all distances measured during the sensitivity test with BIKED.
  • ...and 4 more figures