Table of Contents
Fetching ...

What Makes for a Good Stereoscopic Image?

Netanel Y. Tamir, Shir Amir, Ranel Itzhaky, Noam Atia, Shobhita Sundaram, Stephanie Fu, Ron Sokolovsky, Phillip Isola, Tali Dekel, Richard Zhang, Miriam Farber

TL;DR

This work addresses the lack of holistic, VR-specific SQoE evaluation for stereoscopic content by introducing the SCOPE dataset and the iSQoE predictor. SCOPE comprises 2400 stereo-image samples with diverse distortions and 2AFC human annotations gathered on VR headsets, enabling training of a holistic SQoE model. The iSQoE architecture leverages cross-attention between left and right image backbones, Siamese training with a hinge loss, and LoRA-finetuned DINOv2, showing superior alignment with human preferences over existing SIQA/IQA baselines and robust extrapolation to unseen distortions. The findings underscore the necessity of VR-centric annotations, reveal cross-device variability in perception, and demonstrate practical utility for evaluating mono-to-stereo generation methods in immersive environments.

Abstract

With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations. To address these gaps, we present SCOPE (Stereoscopic COntent Preference Evaluation), a new dataset comprised of real and synthetic stereoscopic images featuring a wide range of common perceptual distortions and artifacts. The dataset is labeled with preference annotations collected on a VR headset, with our findings indicating a notable degree of consistency in user preferences across different headsets. Additionally, we present iSQoE, a new model for stereo quality of experience assessment trained on our dataset. We show that iSQoE aligns better with human preferences than existing methods when comparing mono-to-stereo conversion methods.

What Makes for a Good Stereoscopic Image?

TL;DR

This work addresses the lack of holistic, VR-specific SQoE evaluation for stereoscopic content by introducing the SCOPE dataset and the iSQoE predictor. SCOPE comprises 2400 stereo-image samples with diverse distortions and 2AFC human annotations gathered on VR headsets, enabling training of a holistic SQoE model. The iSQoE architecture leverages cross-attention between left and right image backbones, Siamese training with a hinge loss, and LoRA-finetuned DINOv2, showing superior alignment with human preferences over existing SIQA/IQA baselines and robust extrapolation to unseen distortions. The findings underscore the necessity of VR-centric annotations, reveal cross-device variability in perception, and demonstrate practical utility for evaluating mono-to-stereo generation methods in immersive environments.

Abstract

With rapid advancements in virtual reality (VR) headsets, effectively measuring stereoscopic quality of experience (SQoE) has become essential for delivering immersive and comfortable 3D experiences. However, most existing stereo metrics focus on isolated aspects of the viewing experience such as visual discomfort or image quality, and have traditionally faced data limitations. To address these gaps, we present SCOPE (Stereoscopic COntent Preference Evaluation), a new dataset comprised of real and synthetic stereoscopic images featuring a wide range of common perceptual distortions and artifacts. The dataset is labeled with preference annotations collected on a VR headset, with our findings indicating a notable degree of consistency in user preferences across different headsets. Additionally, we present iSQoE, a new model for stereo quality of experience assessment trained on our dataset. We show that iSQoE aligns better with human preferences than existing methods when comparing mono-to-stereo conversion methods.
Paper Structure (23 sections, 12 figures, 6 tables)

This paper contains 23 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Dataset examples. Each stereo image was subjected to two different distortions, applied consistently to either the left, right, or both images. Participants in a VR-based user study were then asked to choose their preferred version. On the right of each sample, we zoom to highlight the differences between the images. Some distortions are more easily visible in 2D (e.g. Gaussian White Noise, Rotation) while others are more visible on VR devices (e.g. disparity differences cause increased depth sensation in the 2D lifting example).
  • Figure 2: Model architecture. The left and right images of a stereo pair are processed by a modified DINOv2 oquab2023dinov2 network with cross-attention between images. The resulting spatial tokens are pooled, concatenated, and passed through a small fully-connected network, outputting a single value indicating quality (lower is better). We train the model with a hinge loss and LoRA hu2021lora for the DINOv2 network.
  • Figure 3: Test accuracy on SCOPE. We report the mean and standard deviation of the unanimous cases in the test set over several splits. Our model outperforms the other SIQA and IQA models.
  • Figure 4: Progressive degradation evaluation. We report model scores on $200$ stereo images for six different distortions. The gray regions indicate the distortion intensity used in SCOPE. Stereo images (a) -- (d), presented as anaglyphs, exhibit progressive downscaling and are represented by stars.
  • Figure 5: Viewing medium comparison. We measure the correlation between human preferences across viewing mediums by calculating Cohen's kappa coefficient averaged across participants.
  • ...and 7 more figures