Table of Contents
Fetching ...

A multidimensional measurement of photorealistic avatar quality of experience

Ross Cutler, Babak Naderi, Vishak Gopal, Dharmendar Palle

TL;DR

This work addresses the gap between objective metrics and human usability in photorealistic avatars by introducing an open-source crowdsourced framework that measures ten QoE dimensions (realism, trust, comfort using/interacting, appropriateness for work, creepiness, formality, affinity, resemblance, and emotion accuracy) in telecommunication settings. It demonstrates that standard objective metrics (PSNR, SSIM, LPIPS, FID, FVD) largely fail to predict subjective judgments, with emotion accuracy showing the strongest, yet still imperfect, alignment. Through rigorous validation against expert ratings and reproducibility checks, the study shows that for realism above a threshold, subjective dimensions are highly correlated, enabling dimensionality reduction from ten to two components and suggesting templates for efficient evaluation. A key finding is the absence of an uncanny valley effect in telecommunication contexts, with higher realism correlating positively with affinity and several usability factors. The framework and findings offer practical guidance for developing telecommunication systems with photorealistic avatars and highlight the need for subjective testing to drive perceptual quality improvements.

Abstract

Photorealistic avatars are human avatars that look, move, and talk like real people. The performance of photorealistic avatars has significantly improved recently based on objective metrics such as PSNR, SSIM, LPIPS, FID, and FVD. However, recent photorealistic avatar publications do not provide subjective tests of the avatars to measure human usability factors. We provide an open source test framework to subjectively measure photorealistic avatar performance in ten dimensions: realism, trust, comfortableness using, comfortableness interacting with, appropriateness for work, creepiness, formality, affinity, resemblance to the person, and emotion accuracy. Using telecommunication scenarios, we show that the correlation of nine of these subjective metrics with PSNR, SSIM, LPIPS, FID, and FVD is weak, and moderate for emotion accuracy. The crowdsourced subjective test framework is highly reproducible and accurate when compared to a panel of experts. We analyze a wide range of avatars from photorealistic to cartoon-like and show that some photorealistic avatars are approaching real video performance based on these dimensions. We also find that for avatars above a certain level of realism, eight of these measured dimensions are strongly correlated. This means that avatars that are not as realistic as real video will have lower trust, comfortableness using, comfortableness interacting with, appropriateness for work, formality, and affinity, and higher creepiness compared to real video. In addition, because there is a strong linear relationship between avatar affinity and realism, there is no uncanny valley effect for photorealistic avatars in the telecommunication scenario. We suggest several extensions of this test framework for future work and discuss design implications for telecommunication systems. The test framework is available at https://github.com/microsoft/P.910.

A multidimensional measurement of photorealistic avatar quality of experience

TL;DR

This work addresses the gap between objective metrics and human usability in photorealistic avatars by introducing an open-source crowdsourced framework that measures ten QoE dimensions (realism, trust, comfort using/interacting, appropriateness for work, creepiness, formality, affinity, resemblance, and emotion accuracy) in telecommunication settings. It demonstrates that standard objective metrics (PSNR, SSIM, LPIPS, FID, FVD) largely fail to predict subjective judgments, with emotion accuracy showing the strongest, yet still imperfect, alignment. Through rigorous validation against expert ratings and reproducibility checks, the study shows that for realism above a threshold, subjective dimensions are highly correlated, enabling dimensionality reduction from ten to two components and suggesting templates for efficient evaluation. A key finding is the absence of an uncanny valley effect in telecommunication contexts, with higher realism correlating positively with affinity and several usability factors. The framework and findings offer practical guidance for developing telecommunication systems with photorealistic avatars and highlight the need for subjective testing to drive perceptual quality improvements.

Abstract

Photorealistic avatars are human avatars that look, move, and talk like real people. The performance of photorealistic avatars has significantly improved recently based on objective metrics such as PSNR, SSIM, LPIPS, FID, and FVD. However, recent photorealistic avatar publications do not provide subjective tests of the avatars to measure human usability factors. We provide an open source test framework to subjectively measure photorealistic avatar performance in ten dimensions: realism, trust, comfortableness using, comfortableness interacting with, appropriateness for work, creepiness, formality, affinity, resemblance to the person, and emotion accuracy. Using telecommunication scenarios, we show that the correlation of nine of these subjective metrics with PSNR, SSIM, LPIPS, FID, and FVD is weak, and moderate for emotion accuracy. The crowdsourced subjective test framework is highly reproducible and accurate when compared to a panel of experts. We analyze a wide range of avatars from photorealistic to cartoon-like and show that some photorealistic avatars are approaching real video performance based on these dimensions. We also find that for avatars above a certain level of realism, eight of these measured dimensions are strongly correlated. This means that avatars that are not as realistic as real video will have lower trust, comfortableness using, comfortableness interacting with, appropriateness for work, formality, and affinity, and higher creepiness compared to real video. In addition, because there is a strong linear relationship between avatar affinity and realism, there is no uncanny valley effect for photorealistic avatars in the telecommunication scenario. We suggest several extensions of this test framework for future work and discuss design implications for telecommunication systems. The test framework is available at https://github.com/microsoft/P.910.

Paper Structure

This paper contains 39 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Data Flow Diagram.
  • Figure 2: The crowdsourcing test from the participant's perspective.
  • Figure 3: Items in the survey. (a) Template A as represented in the survey including a trapping and repeated items, (b) Template B focuses on resemblance to the person and emotion accuracy.
  • Figure 4: Avatars used for the survey
  • Figure 5: (a) MOS scores per dimension across all models from Template A and (b) Template B.
  • ...and 5 more figures