Structured Uncertainty Similarity Score (SUSS): Learning a Probabilistic, Interpretable, Perceptual Metric Between Images
Paula Seidler, Neill D. F. Campbell, Ivor J A Simpson
TL;DR
SUSS introduces a probabilistic perceptual similarity score built from multiple image components modeled as structured, image-specific multivariate normals. Each component's distribution is learned via self-supervised augmentations using a Structured Uncertainty Prediction Network (SUPN) with a sparse Cholesky factor, enabling interpretable residual whitening and sampling. The final score is a learned weighted sum of component log-likelihoods, with weights derived from human judgments through a 2AFC framework, achieving strong perceptual calibration across distortions and providing local explanations via SUSS maps. Empirically, SUSS outperforms traditional metrics and rivals deep perceptual losses on standard benchmarks, while offering stable optimization and clear interpretability, making it practical as a differentiable perceptual loss for imaging tasks.
Abstract
Perceptual similarity scores that align with human vision are critical for both training and evaluating computer vision models. Deep perceptual losses, such as LPIPS, achieve good alignment but rely on complex, highly non-linear discriminative features with unknown invariances, while hand-crafted measures like SSIM are interpretable but miss key perceptual properties. We introduce the Structured Uncertainty Similarity Score (SUSS); it models each image through a set of perceptual components, each represented by a structured multivariate Normal distribution. These are trained in a generative, self-supervised manner to assign high likelihood to human-imperceptible augmentations. The final score is a weighted sum of component log-probabilities with weights learned from human perceptual datasets. Unlike feature-based methods, SUSS learns image-specific linear transformations of residuals in pixel space, enabling transparent inspection through decorrelated residuals and sampling. SUSS aligns closely with human perceptual judgments, shows strong perceptual calibration across diverse distortion types, and provides localized, interpretable explanations of its similarity assessments. We further demonstrate stable optimization behavior and competitive performance when using SUSS as a perceptual loss for downstream imaging tasks.
