Table of Contents
Fetching ...

UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment

Vlad Hosu, Lorenzo Agnolucci, Oliver Wiedemann, Daisuke Iso, Dietmar Saupe

TL;DR

The paper presents the UHD-IQA Benchmark Database, a large-scale no-reference IQA dataset consisting of 6073 UHD-1 (4K) images annotated at a fixed width of 3840, with high-quality expert crowdsourced ratings and rich metadata. It tackles the lack of high-resolution, high-quality IQA data by combining expert recruitment, reliability-focused crowdsourcing, and a careful sampling pipeline that filters out synthetic images from Pixabay. The dataset structure includes 61 batches and two annotation rounds to obtain 20 ratings per image, enabling robust MOS estimation and reliable evaluation protocols. Evaluation of multiple NR-IQA methods demonstrates that self-supervised and CLIP-based approaches achieve strong correlations on UHD data but still face challenges in RMSE/MAE, underscoring the need for UHD-specific training and models; the dataset thus provides a valuable resource for advancing practical, high-resolution NR-IQA and cross-resolution generalization.

Abstract

We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. Importantly, the dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in 20 highly reliable ratings per image. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html

UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment

TL;DR

The paper presents the UHD-IQA Benchmark Database, a large-scale no-reference IQA dataset consisting of 6073 UHD-1 (4K) images annotated at a fixed width of 3840, with high-quality expert crowdsourced ratings and rich metadata. It tackles the lack of high-resolution, high-quality IQA data by combining expert recruitment, reliability-focused crowdsourcing, and a careful sampling pipeline that filters out synthetic images from Pixabay. The dataset structure includes 61 batches and two annotation rounds to obtain 20 ratings per image, enabling robust MOS estimation and reliable evaluation protocols. Evaluation of multiple NR-IQA methods demonstrates that self-supervised and CLIP-based approaches achieve strong correlations on UHD data but still face challenges in RMSE/MAE, underscoring the need for UHD-specific training and models; the dataset thus provides a valuable resource for advancing practical, high-resolution NR-IQA and cross-resolution generalization.

Abstract

We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. Importantly, the dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in 20 highly reliable ratings per image. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html

Paper Structure

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Sample images from our dataset. The authors of the images are, from left to right, top to bottom: 'Bergadder', 'Daria-Yakovleva', 'Quangpraha', 'StockSnap', 'pcdazero', and 'Free-Photos' from pixabay.com.
  • Figure 2: Resolution discrepancy of a $1024\times 768$px/0.7MP image (filled rectangle) as used in many IQA datasets vs. a $3840\times 2160$px UHD/8.3MP image (inner frame) that is common in our new dataset. The outer frame illustrates a $16320\times 12240$px/200MP image captured by recent smartphone sensors, pointing to future challenges.
  • Figure 3: Examples of synthetic images removed from the initial collection, leaving only authentic photos in the annotated dataset. The authors of the images are, from left to right: 'ColiN00B', 'stokpic', and 'jplenio' from Pixabay.com.
  • Figure 4: (a): Number of images in the training, validation and test set. (b): Distribution of the MOS in each subset.
  • Figure 5: (a): Root Mean Squared Difference (RMSD) between the MOS of different sized groups of participants on the test and validation sets. (b): SRCC between MOS of groups. The error bars show $\pm$1 standard deviation of the performance metrics, and the dots represent the averages. 200 samples of pairs of groups of non-overlapping participants were randomly drawn to compute the statistics.