Table of Contents
Fetching ...

Learning to Generate Images with Perceptual Similarity Metrics

Jake Snell, Karl Ridgeway, Renjie Liao, Brett D. Roads, Michael C. Mozer, Richard S. Zemel

TL;DR

This paper addresses the misalignment between pixelwise losses and human perception in image synthesis. It introduces perceptual losses based on the differentiable multiscale structural-similarity score ($MS\text{-}SSIM$) and evaluates them across deterministic and probabilistic autoencoders, showing that human judgments favor MS\text{-}SSIM reconstructions and that learned representations improve classification and super-resolution tasks. The results demonstrate that perceptually grounded objectives preserve fine detail and structure better than traditional pixel losses, suggesting a path toward more faithful image synthesis and more perceptually relevant representations. The approach is easily integrable with existing architectures and holds practical impact for improving image quality in compression, restoration, and upscaling applications.

Abstract

Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ($\ell_1$ and $\ell_2$ distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception.

Learning to Generate Images with Perceptual Similarity Metrics

TL;DR

This paper addresses the misalignment between pixelwise losses and human perception in image synthesis. It introduces perceptual losses based on the differentiable multiscale structural-similarity score () and evaluates them across deterministic and probabilistic autoencoders, showing that human judgments favor MS\text{-}SSIM reconstructions and that learned representations improve classification and super-resolution tasks. The results demonstrate that perceptually grounded objectives preserve fine detail and structure better than traditional pixel losses, suggesting a path toward more faithful image synthesis and more perceptually relevant representations. The approach is easily integrable with existing architectures and holds practical impact for improving image quality in compression, restoration, and upscaling applications.

Abstract

Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ( and distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception.

Paper Structure

This paper contains 23 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Three examples showing reconstructions of an original image (center) by a standard reconstruction approach (left) and our technique (right). The compression factor is high to highlight the differences.
  • Figure 2: Human judgments of reconstructed images. (a) Fully connected network: Proportion of participants preferring SSIM to MSE for each of 100 image triplets. (b) Deterministic conv. network: Distribution of image quality ranking for MS-SSIM, MSE, and MAE for 1000 images from the STL-10 hold-out set.
  • Figure 3: Image triples consisting of---from left to right---the MSE reconstruction, the original image, and the SSIM reconstruction. Image triples are ordered, from top to bottom and left to right, by the percentage of participants preferring SSIM. (a) Eight images for which participants strongly preferred SSIM over MSE. (b) Eight images for which the smallest proportion of participants preferred SSIM.
  • Figure 4: (a) Four randomly selected, held-out STL-10 images and their reconstructions for the 128-hidden-unit networks. For these images, the MS-SSIM reconstruction was ranked as best by humans. (b) Four randomly selected test images where the MS-SSIM reconstruction was ranked second or third.
  • Figure 5: (a) Four randomly selected, held-out STL-10 images and their reconstructions. For these images, the MS-SSIM reconstruction was ranked as best by humans. Reconstructions are from the 128-hidden-unit VAEs. From left to right are the original image, followed by the MS-SSIM, MSE, and MAE reconstructions. (b) Four randomly selected test images where the MS-SSIM reconstruction was ranked second or third.
  • ...and 2 more figures