Table of Contents
Fetching ...

Enhancing image quality prediction with self-supervised visual masking

Uğur Çoğalan, Mojtaba Bemana, Hans-Peter Seidel, Karol Myszkowski

TL;DR

This work tackles the misalignment between full-reference image quality metrics (FR-IQMs) and human perception by introducing a self-supervised visual masking mechanism. A lightweight CNN learns a per-pixel mask $M$ from a reference $X$ and distorted $Y$, which modulates inputs before applying an FR-IQM $\mathcal{D}$, with a small mapping network $\mathcal{G}$ aligning the metric output to MOS $q$. The approach improves both classical and deep-feature IQMs across CSIQ, TID2013, and PIPAL, producing perceptually faithful error maps and enabling loss-based improvements in denoising and deblurring tasks. The method is lightweight, agnostic to the underlying metric, and shows strong potential for practical deployment in restoration and compression workflows.

Abstract

Full-reference image quality metrics (FR-IQMs) aim to measure the visual differences between a pair of reference and distorted images, with the goal of accurately predicting human judgments. However, existing FR-IQMs, including traditional ones like PSNR and SSIM and even perceptual ones such as HDR-VDP, LPIPS, and DISTS, still fall short in capturing the complexities and nuances of human perception. In this work, rather than devising a novel IQM model, we seek to improve upon the perceptual quality of existing FR-IQM methods. We achieve this by considering visual masking, an important characteristic of the human visual system that changes its sensitivity to distortions as a function of local image content. Specifically, for a given FR-IQM metric, we propose to predict a visual masking model that modulates reference and distorted images in a way that penalizes the visual errors based on their visibility. Since the ground truth visual masks are difficult to obtain, we demonstrate how they can be derived in a self-supervised manner solely based on mean opinion scores (MOS) collected from an FR-IQM dataset. Our approach results in enhanced FR-IQM metrics that are more in line with human prediction both visually and quantitatively.

Enhancing image quality prediction with self-supervised visual masking

TL;DR

This work tackles the misalignment between full-reference image quality metrics (FR-IQMs) and human perception by introducing a self-supervised visual masking mechanism. A lightweight CNN learns a per-pixel mask from a reference and distorted , which modulates inputs before applying an FR-IQM , with a small mapping network aligning the metric output to MOS . The approach improves both classical and deep-feature IQMs across CSIQ, TID2013, and PIPAL, producing perceptually faithful error maps and enabling loss-based improvements in denoising and deblurring tasks. The method is lightweight, agnostic to the underlying metric, and shows strong potential for practical deployment in restoration and compression workflows.

Abstract

Full-reference image quality metrics (FR-IQMs) aim to measure the visual differences between a pair of reference and distorted images, with the goal of accurately predicting human judgments. However, existing FR-IQMs, including traditional ones like PSNR and SSIM and even perceptual ones such as HDR-VDP, LPIPS, and DISTS, still fall short in capturing the complexities and nuances of human perception. In this work, rather than devising a novel IQM model, we seek to improve upon the perceptual quality of existing FR-IQM methods. We achieve this by considering visual masking, an important characteristic of the human visual system that changes its sensitivity to distortions as a function of local image content. Specifically, for a given FR-IQM metric, we propose to predict a visual masking model that modulates reference and distorted images in a way that penalizes the visual errors based on their visibility. Since the ground truth visual masks are difficult to obtain, we demonstrate how they can be derived in a self-supervised manner solely based on mean opinion scores (MOS) collected from an FR-IQM dataset. Our approach results in enhanced FR-IQM metrics that are more in line with human prediction both visually and quantitatively.
Paper Structure (10 sections, 2 equations, 10 figures, 4 tables)

This paper contains 10 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Agreement of metric predictions with human judgments. We consider the classic (MAE and SSIM) and learning-based (LPIPS and DISTS) metrics, and we compare their prediction to their enhanced versions (E-MAE, E-SSIM, E-DISTS, and E-LPIPS) using our approach. On the left, we see a situation where MAE and SSIM favor JPEG-like artifacts over slightly resampled textures. On the right, we encounter a scenario where LPIPS and DISTS prefer blur over a subtle color shift. Our extended metric versions are better aligned with human choice. The images have been extracted from the PIPAL dataset gu2020pipal.
  • Figure 2: Our proposed visual masking for enhancing classic metrics such as MAE and SSIM (left) and learning-based metrics such as DISTS or LPIPS (right). For classic metrics, the input to our mask predictor network $\mathcal{F}$F are sRGB images, while for learning-based metrics, the inputs are the VGG features extracted from the images. We learn the visual masks in a self-supervised fashion by minimizing the difference between the metric final score and human scores collected from an FR-IQM dataset.
  • Figure 3: Visual comparisons of distortion visibility maps for Gaussian noise (upper row) and superresolution artifacts (middle and bottom rows). The distortion examples are taken from the PIPAL dataset. The first two columns present the reference and distorted images, followed by the respective metric predictions: MAE, HDR-VDP-2 mantiuk2011hdr, LocVis wolski2018dataset, FovVideoVDP mantiuk2021fovvideovdp, and our E-MAE. Here, we additionally visualize the MAE map to better understand the characteristics of each distortion. As can be seen, the existing metrics tend to either underestimate or overestimate the distortion visibility. Note that LocVis and E-MAE have not seen distorted images with superresolution artifacts in their training.
  • Figure 4: Comparison of our E-MAE metric masks for the noise (fifth row) and blur (sixth row) distortions as a function of different image contrast ($\times 0.5, \times 1,$ and $\times 2$). In the fourth row, we also show a map with the human sensitivity to local contrast changes as predicted by a traditional model of visual contrast masking tursun2019luminance. In all cases, darker means more masking (less sensitive to distortion).
  • Figure 5: Visualisation of predicted mask across different metrics for a given pair of reference and distorted images with Gaussian noise from the TID dataset. Note that the SSIM values have been remapped to 1-SSIM, where lower values indicate less visible errors. In the case of the PSNR, we show the error map for the measured MSE. For the VGG metric, we visualize the predicted mask for all layers, while the error map is shown only for the first layer.
  • ...and 5 more figures