Table of Contents
Fetching ...

QPT V2: Masked Image Modeling Advances Visual Scoring

Qizhi Xie, Kun Yuan, Yunpeng Qu, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu

TL;DR

Visual scoring tasks suffer from limited labeled data. This paper presents QPT V2, a masked image modeling–based pretraining framework tailored for IQA, VQA, and IAA, enabled by data curation (HR and HFC), quality- and aesthetics-aware degradations, and a multi-scale HiViT encoder. The approach achieves state-of-the-art results on 11 downstream benchmarks, demonstrating strong generalization and data efficiency across synthetic and real-world distortions. By unifying VS tasks under a single MIM paradigm, the work highlights the potential of incorporating human visual system priors into pretraining to improve perceptual quality assessment in real-world applications.

Abstract

Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at \url{https://github.com/KeiChiTse/QPT-V2}.

QPT V2: Masked Image Modeling Advances Visual Scoring

TL;DR

Visual scoring tasks suffer from limited labeled data. This paper presents QPT V2, a masked image modeling–based pretraining framework tailored for IQA, VQA, and IAA, enabled by data curation (HR and HFC), quality- and aesthetics-aware degradations, and a multi-scale HiViT encoder. The approach achieves state-of-the-art results on 11 downstream benchmarks, demonstrating strong generalization and data efficiency across synthetic and real-world distortions. By unifying VS tasks under a single MIM paradigm, the work highlights the potential of incorporating human visual system priors into pretraining to improve perceptual quality assessment in real-world applications.

Abstract

Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at \url{https://github.com/KeiChiTse/QPT-V2}.
Paper Structure (28 sections, 4 equations, 6 figures, 10 tables)

This paper contains 28 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: QPT V2: a new MIM-based pretraining paradigm for visual scoring. For pretraining, dataset $\mathcal{D}_I$ provides HR & HFC images, augmented by quality- and aesthetics-aware degradation $\mathcal{A}(\cdot)$. A multi-scale autoencoder $\mathcal{G}(\cdot)$ outputs the reconstructed images. Through finetuning of the encoder, it can solve visual scoring tasks like IQA, VQA, and IAA.
  • Figure 2: Semantics- and distortion-awareness of the pixel-based MIM. (a) MIM has the ability to understand the semantics; (b) Pixel-based MIM can reconstruct the distortions applied to original images, the left column and the right column are high and low-frequency intervals, respectively.
  • Figure 3: Illustration of the gap in FC between SA-1B and DIV2K a) qualitatively and b) quantitatively after cropping.
  • Figure 4: Overview of our proposed QPT V2. QPT V2 incorporates three improvements based on pixel-based MIM tailored for VS. To curate HR & HFC training data, we examine the resolution and foreground coverage of various datasets and samples. To determine quality- and aesthetics-aware degradation, we explore the degradation type and composition. To perceive distortion and aesthetics information in multi-scale fashion, we design a pretrain-only feature fusion module in a hierarchical encoder.
  • Figure 5: Illustration of the studied degradations, each transforms data stochastically.
  • ...and 1 more figures