Table of Contents
Fetching ...

Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images

Paula Seidler, Neill D. F. Campbell, Ivor J A Simpson

TL;DR

SUSS introduces a probabilistic perceptual similarity score built from multiple image components modeled as structured, image-specific multivariate normals. Each component's distribution is learned via self-supervised augmentations using a Structured Uncertainty Prediction Network (SUPN) with a sparse Cholesky factor, enabling interpretable residual whitening and sampling. The final score is a learned weighted sum of component log-likelihoods, with weights derived from human judgments through a 2AFC framework, achieving strong perceptual calibration across distortions and providing local explanations via SUSS maps. Empirically, SUSS outperforms traditional metrics and rivals deep perceptual losses on standard benchmarks, while offering stable optimization and clear interpretability, making it practical as a differentiable perceptual loss for imaging tasks.

Abstract

Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.

Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images

TL;DR

SUSS introduces a probabilistic perceptual similarity score built from multiple image components modeled as structured, image-specific multivariate normals. Each component's distribution is learned via self-supervised augmentations using a Structured Uncertainty Prediction Network (SUPN) with a sparse Cholesky factor, enabling interpretable residual whitening and sampling. The final score is a learned weighted sum of component log-likelihoods, with weights derived from human judgments through a 2AFC framework, achieving strong perceptual calibration across distortions and providing local explanations via SUSS maps. Empirically, SUSS outperforms traditional metrics and rivals deep perceptual losses on standard benchmarks, while offering stable optimization and clear interpretability, making it practical as a differentiable perceptual loss for imaging tasks.

Abstract

Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.

Paper Structure

This paper contains 19 sections, 5 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: For pairs of images (top row) humans can determine the level of similarity, and offer post-hoc explanations. Deep learned perceptual metrics, such as LPIPS (third row), use features that are a complex non-linear transformation of the images and contain many invariances; making the rationale for a result hard to explain. This is illustrated by examining the magnitude of the difference of feature maps for the two example image pairs: we observe diffuse minor differences between two dissimilar images, and highly localized distinctions between closely related images. Our proposed method, SUSS (bottom row) provides an interpretable pixelwise map that is based on linearly transformed residuals between the two images, these directly highlights only the differences that are considered perceivable reducing the impact of minor alterations.
  • Figure 2: SUSS evaluates the similarity between images by separately constructing distributions for multi-scale structural (y) and color similarity (cb, cr). Our CNN, SUPN UNet, predicts a distribution of perceptually close augmented image data, $\tilde{X}$, as a multivariate normal distribution, with a structured Cholesky factored inverse covariance. This model is trained in a self-supervised fashion. SUSS is then calculated as a weighted sum of log probabilities, with weights derived from a training set using human annotations.
  • Figure 3: Inspection of perceptually relevant pixel structures learned by SUSS. (a–b) show small and medium transformations. The residual is the difference of feature maps, and the whitened residual $L(X)^\top R$ is a linear transform of this, as learned by SUSS. The whitened residual highlights regions of structure that is considered perceptually distant, suppressing perceptually irrelevant noise. (c) illustrates the summative SUSS map.
  • Figure 4: How well does SUSS navigate the optimization landscape as a perceptual loss objective? We perform a reconstruction task where the goal is to recover the original image (left) from distorted inputs. Using LPIPS and SUSS as loss objectives, with optimal optimization settings, we minimize the perceptual score between the reference and the input and visualise the results. Both losses converge to high similarity scores (matching MOS levels associated with imperceptible differences in KADID10K).
  • Figure 5: Samples from learned distribution (SUSSPieApp-RH model) with original image, and close, medium and far samples(left to right) for different components of the SUSS model, RGB shows CbCr samples combined with the original Y. Samples are drawn from the learned SUPN distributions; we select the minimum, median, and maximum probabilities from 1000 samples.
  • ...and 17 more figures