Table of Contents
Fetching ...

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu

Abstract

This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Abstract

This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.
Paper Structure (14 sections, 7 figures, 2 tables)

This paper contains 14 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Trend of full-reference IQA metrics (PSNR, SSIM, LPIPS). The percentages shown below each category represent the corresponding winning rates among those categories. "Best Category Mean" represents the mean value of the optimal model performance of each category.
  • Figure 2: Metrics such as PSNR, SSIM, and LPIPS often fail to accurately reflect perceptual image quality. Higher values indicate better performance for PSNR and SSIM, while lower values are preferred for LPIPS. The best result for each metric across different methods is highlighted in red. Zoom in for a better observation.
  • Figure 3: Trend of NR-IQA metrics (MUSIQ, MANIQA, CLIP-IQA). The percentages shown below each category represent the corresponding winning rates among those categories. "Best Category Mean" represents the mean value of the optimal model performance of each category.
  • Figure 4: Simple image manipulations can artificially boost NR IQA metrics, highlighting their vulnerability to manipulation.
  • Figure 5: User preference for model performance varies across different semantic scenarios.
  • ...and 2 more figures