Table of Contents
Fetching ...

Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos

Shreshth Saini, Bowen Chen, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik

TL;DR

HDR-Q, the first Multimodal Large Language Model for HDR-UGC VQA, is introduced and a novel HDR-aware vision encoder is proposed to produce HDR-sensitive embeddings and an RL finetuning framework that anchors reasoning to HDR cues is proposed.

Abstract

High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.

Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos

TL;DR

HDR-Q, the first Multimodal Large Language Model for HDR-UGC VQA, is introduced and a novel HDR-aware vision encoder is proposed to produce HDR-sensitive embeddings and an RL finetuning framework that anchors reasoning to HDR cues is proposed.

Abstract

High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR has a higher bit depth, wide color gamut, and elevated luminance range, exposing distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate Beyond8Bits, a large-scale subjective dataset of 44K videos from 6.5K sources with over 1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce HDR-Q, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR-SDR contrastive KL that encourages token reliance on HDR inputs and a Gaussian weighted regression reward for fine-grained MOS calibration. Across Beyond8Bits and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance.
Paper Structure (31 sections, 11 equations, 7 figures, 3 tables)

This paper contains 31 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our dataset and performance evaluation. Top: Example comparisons between HDR and SDR frames, illustrating differences in brightness range, color depth, and visual detail across diverse scenes. Bottom-left: The distribution of video categories in the Beyond8Bits dataset, covering human-centered content, nature & outdoor scenes, and various other real-world scenarios. Bottom-right: Performance comparison between our proposed HDR-Q model and baseline methods on three datasets, where HDR-Q achieves significant improvements in PLCC.
  • Figure 2: Typical challenges in HDR-UGC videos, including HDR-specific issues (e.g., highlight clipping, color blooming, dark grain) and UGC/compression-related artifacts (e.g., color distortion, blocking, ringing, blurring).
  • Figure 3: Pipeline of Beyond8Bits construction.
  • Figure 4: Overview of HDR-Q with HAPO. Left: HAPO compares rollouts under HDR inputs (text + SDR + HDR tokens) versus an HDR-deprived pathway (text + SDR only), maximizing their KL divergence to enforce HDR grounding and applying dual-entropy regularization to prevent reward hacking. Group-wise rewards include MOS/attribute accuracy, reasoning quality, and self-rewarding. Right: a LoRA-tuned LLM decodes the HDR-aware reasoning; visual inputs originate from both a standard encoder and our HDR-aware adapter.
  • Figure 5: HDR-aware vision encoder finetuning. We adapt SigLIP-2 siglip2 using HDR–SDR frame–caption pairs with captions generated by Qwen2.5-VL-72B, promoting perceptually aligned HDR embeddings.
  • ...and 2 more figures