Table of Contents
Fetching ...

ColorVideoVDP: A visual difference predictor for image, video and display distortions

Rafal K. Mantiuk, Param Hanji, Maliha Ashraf, Yuta Asano, Alexandre Chapiro

TL;DR

ColorVideoVDP introduces a fully differentiable, full-reference metric that jointly models color and spatiotemporal vision with a display-aware pipeline. Built on castleCSF and cross-channel masking in the Derrington-Krauskopf-Lennie space, it predicts Just-Objectionable-Differences (JOD) while producing per-channel distortion visualizations. Calibrated with XR-DAVID and existing SDR/HDR datasets, it demonstrates improved prediction accuracy across diverse content and XR artifacts, and supports applications in chroma subsampling, display-tolerance specifications, and perceptual optimization. The method fills a gap in video/image quality assessment by integrating color, temporal dynamics, and display characteristics into a single, interpretable, and differentiable framework, enabling perceptually guided design and optimization in modern displays and XR systems.

Abstract

ColorVideoVDP is a video and image quality metric that models spatial and temporal aspects of vision, for both luminance and color. The metric is built on novel psychophysical models of chromatic spatiotemporal contrast sensitivity and cross-channel contrast masking. It accounts for the viewing conditions, geometric, and photometric characteristics of the display. It was trained to predict common video streaming distortions (e.g. video compression, rescaling, and transmission errors), and also 8 new distortion types related to AR/VR displays (e.g. light source and waveguide non-uniformities). To address the latter application, we collected our novel XR-Display-Artifact-Video quality dataset (XR-DAVID), comprised of 336 distorted videos. Extensive testing on XR-DAVID, as well as several datasets from the literature, indicate a significant gain in prediction performance compared to existing metrics. ColorVideoVDP opens the doors to many novel applications which require the joint automated spatiotemporal assessment of luminance and color distortions, including video streaming, display specification and design, visual comparison of results, and perceptually-guided quality optimization.

ColorVideoVDP: A visual difference predictor for image, video and display distortions

TL;DR

ColorVideoVDP introduces a fully differentiable, full-reference metric that jointly models color and spatiotemporal vision with a display-aware pipeline. Built on castleCSF and cross-channel masking in the Derrington-Krauskopf-Lennie space, it predicts Just-Objectionable-Differences (JOD) while producing per-channel distortion visualizations. Calibrated with XR-DAVID and existing SDR/HDR datasets, it demonstrates improved prediction accuracy across diverse content and XR artifacts, and supports applications in chroma subsampling, display-tolerance specifications, and perceptual optimization. The method fills a gap in video/image quality assessment by integrating color, temporal dynamics, and display characteristics into a single, interpretable, and differentiable framework, enabling perceptually guided design and optimization in modern displays and XR systems.

Abstract

ColorVideoVDP is a video and image quality metric that models spatial and temporal aspects of vision, for both luminance and color. The metric is built on novel psychophysical models of chromatic spatiotemporal contrast sensitivity and cross-channel contrast masking. It accounts for the viewing conditions, geometric, and photometric characteristics of the display. It was trained to predict common video streaming distortions (e.g. video compression, rescaling, and transmission errors), and also 8 new distortion types related to AR/VR displays (e.g. light source and waveguide non-uniformities). To address the latter application, we collected our novel XR-Display-Artifact-Video quality dataset (XR-DAVID), comprised of 336 distorted videos. Extensive testing on XR-DAVID, as well as several datasets from the literature, indicate a significant gain in prediction performance compared to existing metrics. ColorVideoVDP opens the doors to many novel applications which require the joint automated spatiotemporal assessment of luminance and color distortions, including video streaming, display specification and design, visual comparison of results, and perceptually-guided quality optimization.
Paper Structure (50 sections, 19 equations, 24 figures, 3 tables)

This paper contains 50 sections, 19 equations, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Processing stages of ColorVideoVDP. Test and reference images are first processed by the same pipeline: the display model maps pixel values to linear color (CIE 1931 XYZ color space), linear color is transformed to the opponent color space (DKL), the achromatic channel is decomposed into sustained (continuous lines) and transient (dashed lines) temporal channels, then each of those is decomposed into multiple spatial bands (Laplacian pyramid). The decomposed video/image goes into the contrast sensitivity and masking models, explained in more detail in Figure \ref{['fig:masking-model']}. The result of the masking model is pooled across all spatial bands, temporal and color channels, and finally regressed to a JOD score.
  • Figure 2: The frequency characteristic of the four temporal channels used in https://github.com/gfxdisp/ColorVideoVDP.
  • Figure 3: castleCSF contrast sensitivity function for the four channels of https://github.com/gfxdisp/ColorVideoVDP. The sensitivity is expressed in the cone-contrast units Wuerger2020. Note that the achromatic transient and chromatic RG channels appear to have higher sensitivity than the achromatic sustained channel. This is due to the scaling used in the DKL color space Derrington1984 to represent chromatic contrast units. In practice, the transient and two chromatic channels are much less sensitive to patterns found in complex images.
  • Figure 4: Example of contrast masking. A sinusoidal grating of 4 cpd (when seen printed from 40 cm) was added to a reference image (top row) to obtain a distorted image (second row). The image was split into three parts and the grating was modulated along achromatic, red-green, and yellow-violet directions in each respective part. The third row shows the visual difference map generated by https://github.com/gfxdisp/ColorVideoVDP without masking but still using the CSF. The map over-predicts the visibility in textured areas. The bottom row shows the prediction with the masking model. It is worth noting: although all three color directions are equally visible in the gray bar at the bottom of the image (where there is little masking), the red-green pattern is more visible in the textured area because it is weakly masked by the achromatic channel. https://github.com/gfxdisp/ColorVideoVDP correctly predicts this phenomenon.
  • Figure 5: Our masking model. Here, the resulting visual difference for the sustained channel, $D_{b,\mathcal{S}}$, is visualized for a single spatial frequency band (the frame index $f$ is omitted for clarity). Each band response is multiplied by the contrast sensitivity function ("S" boxes). The CSF-normalized band-responses are used to calculate the difference between test and reference frames. The masking signal is computed by first finding mutual masking between both channels ("MM" blocks), applying a spatial pooling in the local neighborhood of each pixel ("SP" blocks) and then combining the masking signal from multiple channels (cross-channel masking). The visual per-channel and per-band difference between the test and reference is calculated as the ratio of excitatory difference between the test and reference images, and the inhibitory masking signal, as shown in the equation in the box and in Eq. \ref{['eq:masking-model']}.
  • ...and 19 more figures