Table of Contents
Fetching ...

Do image and video quality metrics model low-level human vision?

Dounia Hammou, Yancheng Cai, Pavan Madhusudanarao, Christos G. Bampis, Rafał K. Mantiuk

TL;DR

The paper presents a framework to evaluate whether image and video quality metrics reflect low-level human vision by deploying 11 psychophysically inspired tests that probe contrast sensitivity, contrast masking, flicker, and supra-threshold contrast matching. By evaluating 33 metrics across these tests, it reveals that some perceptual metrics (e.g., ColorVideoVDP, HDR-VDP-3) align with low-level vision in several dimensions, while popular metrics like SSIM can overemphasize high frequencies and VMAF may underperform in masking tasks. The framework uses contour plots and alignment/RMSE measures to diagnose strengths and weaknesses of each metric, offering a diagnostic tool to improve perceptual alignment beyond subjective MOS correlations. The results highlight that deep-learning-based metrics can capture masking characteristics even without explicit training on such data, yet no single metric fully captures all low-level vision phenomena, underscoring the need for targeted evaluation when selecting or designing perceptual quality metrics. Overall, the work advances understanding of how current metrics relate to fundamental visual mechanisms and provides a practical methodology for improving perceptual fidelity in quality assessment tools.

Abstract

Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be "perceptual". Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference quality metrics that examine their ability to model several aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. The tests are meant to provide additional scrutiny for newly proposed metrics. We use our tests to analyze 33 existing image and video quality metrics and find their strengths and weaknesses, such as the ability of LPIPS and MS-SSIM to predict contrast masking and poor performance of VMAF in this task. We further find that the popular SSIM metric overemphasizes differences in high spatial frequencies, but its multi-scale counterpart, MS-SSIM, addresses this shortcoming. Such findings cannot be easily made using existing evaluation protocols.

Do image and video quality metrics model low-level human vision?

TL;DR

The paper presents a framework to evaluate whether image and video quality metrics reflect low-level human vision by deploying 11 psychophysically inspired tests that probe contrast sensitivity, contrast masking, flicker, and supra-threshold contrast matching. By evaluating 33 metrics across these tests, it reveals that some perceptual metrics (e.g., ColorVideoVDP, HDR-VDP-3) align with low-level vision in several dimensions, while popular metrics like SSIM can overemphasize high frequencies and VMAF may underperform in masking tasks. The framework uses contour plots and alignment/RMSE measures to diagnose strengths and weaknesses of each metric, offering a diagnostic tool to improve perceptual alignment beyond subjective MOS correlations. The results highlight that deep-learning-based metrics can capture masking characteristics even without explicit training on such data, yet no single metric fully captures all low-level vision phenomena, underscoring the need for targeted evaluation when selecting or designing perceptual quality metrics. Overall, the work advances understanding of how current metrics relate to fundamental visual mechanisms and provides a practical methodology for improving perceptual fidelity in quality assessment tools.

Abstract

Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be "perceptual". Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference quality metrics that examine their ability to model several aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. The tests are meant to provide additional scrutiny for newly proposed metrics. We use our tests to analyze 33 existing image and video quality metrics and find their strengths and weaknesses, such as the ability of LPIPS and MS-SSIM to predict contrast masking and poor performance of VMAF in this task. We further find that the popular SSIM metric overemphasizes differences in high spatial frequencies, but its multi-scale counterpart, MS-SSIM, addresses this shortcoming. Such findings cannot be easily made using existing evaluation protocols.

Paper Structure

This paper contains 27 sections, 12 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: A representation of the methodologies employed to evaluate the metrics on the various tests (each in a column). The first row represents a grid of the test images. The second row represents the reference image(s) for each test. For tests where the reference image was a uniform field, we showcase only one image. The third row showcases the metric results on the specific test, for example, a contour plot for the detection test, as well as the performance score (an alignment score (AS) or RMSE).
  • Figure 2: Metric predictions compared to the human data for selected quality metrics. Each row represents a test, and each column corresponds to a different metric. The detection tests are in rows (a)--(g), and the alignment score ($\uparrow$) for each is reported in the square brackets. The red lines show the human performance based on castleCSF for (a)--(d), measurements for (e)--(f), and elaTCSF for (g). The contract matching tests are reported in rows (h) and (i) with the RMSE ($\downarrow$) in the square brackets. The human performance is shown as dashed lines in (h). The perfect alignment for matching across color directions (i) should result in horizontal lines.
  • Figure S1: A representation of the test images used in the contrast detection across spatial frequencies test. We showcase the achromatic Gabor patches (test images) across spatial frequency (x-axis) and contrast (y-axis). In the test, we employed spatial frequencies up to 32 cpd (cycles per visual degree); however, we show them up to 16 cpd in the figure, as higher-frequency patterns may introduce aliasing artifacts on screen or print (these aliasing artifacts were not present in the test).
  • Figure S2: A representation of the test images used in the contrast detection across spatial frequencies test. We showcase the red-green (RG) Gabor patches (test images) across spatial frequency (x-axis) and contrast (y-axis).
  • Figure S3: A representation of the test images used in the contrast detection across spatial frequencies test. We showcase the yellow-violet (YV) Gabor patches (test images) across spatial frequency (x-axis) and contrast (y-axis).
  • ...and 6 more figures