Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Jinfan Hu; Fanghua Yu; Zhiyuan You; Xiang Yin; Hongyu An; Xinqi Lin; Chao Dong; Jinjin Gu

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu

Abstract

This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Abstract

Paper Structure (14 sections, 7 figures, 2 tables)

This paper contains 14 sections, 7 figures, 2 tables.

Introduction
Our Position
Alternative Views
The Risks of Metric-Centric Research
IQA Metrics as De Facto Research Objectives
The Gap Between Metrics and Perceptual Quality
Exploitable Metrics and Inflated Scores
Human-Centric Evaluation as the Standard
Evolution of Human Evaluation Protocols
The Need for Multi-dimensional Human Evaluation
Metrics Still Matter—But Must Evolve
The Scale Gap between IQA and Restoration
Toward Semantic-Aware Image Quality Assessment
Conclusion

Figures (7)

Figure 1: Trend of full-reference IQA metrics (PSNR, SSIM, LPIPS). The percentages shown below each category represent the corresponding winning rates among those categories. "Best Category Mean" represents the mean value of the optimal model performance of each category.
Figure 2: Metrics such as PSNR, SSIM, and LPIPS often fail to accurately reflect perceptual image quality. Higher values indicate better performance for PSNR and SSIM, while lower values are preferred for LPIPS. The best result for each metric across different methods is highlighted in red. Zoom in for a better observation.
Figure 3: Trend of NR-IQA metrics (MUSIQ, MANIQA, CLIP-IQA). The percentages shown below each category represent the corresponding winning rates among those categories. "Best Category Mean" represents the mean value of the optimal model performance of each category.
Figure 4: Simple image manipulations can artificially boost NR IQA metrics, highlighting their vulnerability to manipulation.
Figure 5: User preference for model performance varies across different semantic scenarios.
...and 2 more figures

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Abstract

Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Authors

Abstract

Table of Contents

Figures (7)