Table of Contents
Fetching ...

E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles

Markus Kettunen, Erik Härkönen, Jaakko Lehtinen

TL;DR

This work exposes brittleness in LPIPS-based perceptual similarity under adversarial perturbations and introduces E-LPIPS, a self-ensembled metric built from random input transformations applied across all CNN layers. The ensemble yields markedly improved robustness against attacks while preserving correlation with human judgments, and reveals perceptual convexity in image space, including barycenters and geodesics that align with intuitive visual transformations. The approach also shows practical benefits when used as a loss function for image restoration tasks. Overall, E-LPIPS advances perceptual imaging by combining robustness, human-aligned judgment, and rich geometric structure without requiring explicit correspondences.

Abstract

It has been recently shown that the hidden variables of convolutional neural networks make for an efficient perceptual similarity metric that accurately predicts human judgment on relative image similarity assessment. First, we show that such learned perceptual similarity metrics (LPIPS) are susceptible to adversarial attacks that dramatically contradict human visual similarity judgment. While this is not surprising in light of neural networks' well-known weakness to adversarial perturbations, we proceed to show that self-ensembling with an infinite family of random transformations of the input --- a technique known not to render classification networks robust --- is enough to turn the metric robust against attack, while retaining predictive power on human judgments. Finally, we study the geometry imposed by our our novel self-ensembled metric (E-LPIPS) on the space of natural images. We find evidence of "perceptual convexity" by showing that convex combinations of similar-looking images retain appearance, and that discrete geodesics yield meaningful frame interpolation and texture morphing, all without explicit correspondences.

E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles

TL;DR

This work exposes brittleness in LPIPS-based perceptual similarity under adversarial perturbations and introduces E-LPIPS, a self-ensembled metric built from random input transformations applied across all CNN layers. The ensemble yields markedly improved robustness against attacks while preserving correlation with human judgments, and reveals perceptual convexity in image space, including barycenters and geodesics that align with intuitive visual transformations. The approach also shows practical benefits when used as a loss function for image restoration tasks. Overall, E-LPIPS advances perceptual imaging by combining robustness, human-aligned judgment, and rich geometric structure without requiring explicit correspondences.

Abstract

It has been recently shown that the hidden variables of convolutional neural networks make for an efficient perceptual similarity metric that accurately predicts human judgment on relative image similarity assessment. First, we show that such learned perceptual similarity metrics (LPIPS) are susceptible to adversarial attacks that dramatically contradict human visual similarity judgment. While this is not surprising in light of neural networks' well-known weakness to adversarial perturbations, we proceed to show that self-ensembling with an infinite family of random transformations of the input --- a technique known not to render classification networks robust --- is enough to turn the metric robust against attack, while retaining predictive power on human judgments. Finally, we study the geometry imposed by our our novel self-ensembled metric (E-LPIPS) on the space of natural images. We find evidence of "perceptual convexity" by showing that convex combinations of similar-looking images retain appearance, and that discrete geodesics yield meaningful frame interpolation and texture morphing, all without explicit correspondences.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: Attack (A1). Both lpips metrics allow to pull far away from the source towards the target while remaining at the same lpips distance from the source as the anchor. With the same constraint, attacks on e-lpips lie much closer to the source image both visually and by relative $L_2$ distance.
  • Figure 2: Attack (A2). Both lpips metrics allow the image to be pushed far away in distance by modifications that are small both visually and in $L_2$ sense. In contrast, the attack is unable to increase the e-lpips distance nearly as much; furthermore the visual change is much more clearly visible at the same $L_2$ distance, which is desirable. The relative distance reported below the images is normalized such that 1 is the mean distance between the different images in the dataset.
  • Figure 3: Success of Attack (A1) against increasingly powerful variants of e-lpips. The increasing robustness resulting from a richer transformation ensemble is visible as the increasing visual similarity between (a) and attack results (c)-(h). Image (h) corresponds to the full e-lpips metric.
  • Figure 4: Barycenters of similar-looking images $x_1, x_2, \cdots, x_{10}$ under various distance metrics. Top row: 10 noise realizations of the same image. Bottom row: small translations of the same image. The $L_2$ barycenter simply averages the inputs. Unlike $L_2$ and lpips, the e-lpips barycenter retains much of the appearance of the input images in both cases.
  • Figure 5: Discrete geodesics between Images A and B, computed in three metrics. Each column shows a single frame from the discrete geodesic, as well as the time evolution of the scanline indicated in red. The reader is encouraged to view the supplemental animations.