Table of Contents
Fetching ...

Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

Xunchu Zhou, Xiaohong Liu, Yunlong Dong, Tengchuan Kou, Yixuan Gao, Zicheng Zhang, Chunyi Li, Haoning Wu, Guangtao Zhai

TL;DR

This work targets the specialized problem of judging video quality after exposure correction in user-generated content. It introduces VEC-QA, a dataset combining LLVE-QA and OEVR-QA to cover low-light and over-exposure recovery, and presents Light-VQA+, a CLIP-guided VQA model that fuses spatial and temporal cues via cross-attention and applies Human Visual System (HVS) inspired weighting to produce a final quality score. Light-VQA+ demonstrates superior correlation with human perception across VEC-QA and public benchmarks, and ablations verify the contributions of CLIP-based brightness/noise features, temporal brightness consistency, cross-attention fusion, and HVS weighting. The model also proves useful for improving exposure-correction algorithms, as shown by fine-tuning FEC-Net with Light-VQA+-guided supervision. Overall, Light-VQA+ offers a specialized, perceptually aligned metric to advance exposure correction methods for videos and supports broader evaluation and development of VEC algorithms.

Abstract

Recently, User-Generated Content (UGC) videos have gained popularity in our daily lives. However, UGC videos often suffer from poor exposure due to the limitations of photographic equipment and techniques. Therefore, Video Exposure Correction (VEC) algorithms have been proposed, Low-Light Video Enhancement (LLVE) and Over-Exposed Video Recovery (OEVR) included. Equally important to the VEC is the Video Quality Assessment (VQA). Unfortunately, almost all existing VQA models are built generally, measuring the quality of a video from a comprehensive perspective. As a result, Light-VQA, trained on LLVE-QA, is proposed for assessing LLVE. We extend the work of Light-VQA by expanding the LLVE-QA dataset into Video Exposure Correction Quality Assessment (VEC-QA) dataset with over-exposed videos and their corresponding corrected versions. In addition, we propose Light-VQA+, a VQA model specialized in assessing VEC. Light-VQA+ differs from Light-VQA mainly from the usage of the CLIP model and the vision-language guidance during the feature extraction, followed by a new module referring to the Human Visual System (HVS) for more accurate assessment. Extensive experimental results show that our model achieves the best performance against the current State-Of-The-Art (SOTA) VQA models on the VEC-QA dataset and other public datasets.

Light-VQA+: A Video Quality Assessment Model for Exposure Correction with Vision-Language Guidance

TL;DR

This work targets the specialized problem of judging video quality after exposure correction in user-generated content. It introduces VEC-QA, a dataset combining LLVE-QA and OEVR-QA to cover low-light and over-exposure recovery, and presents Light-VQA+, a CLIP-guided VQA model that fuses spatial and temporal cues via cross-attention and applies Human Visual System (HVS) inspired weighting to produce a final quality score. Light-VQA+ demonstrates superior correlation with human perception across VEC-QA and public benchmarks, and ablations verify the contributions of CLIP-based brightness/noise features, temporal brightness consistency, cross-attention fusion, and HVS weighting. The model also proves useful for improving exposure-correction algorithms, as shown by fine-tuning FEC-Net with Light-VQA+-guided supervision. Overall, Light-VQA+ offers a specialized, perceptually aligned metric to advance exposure correction methods for videos and supports broader evaluation and development of VEC algorithms.

Abstract

Recently, User-Generated Content (UGC) videos have gained popularity in our daily lives. However, UGC videos often suffer from poor exposure due to the limitations of photographic equipment and techniques. Therefore, Video Exposure Correction (VEC) algorithms have been proposed, Low-Light Video Enhancement (LLVE) and Over-Exposed Video Recovery (OEVR) included. Equally important to the VEC is the Video Quality Assessment (VQA). Unfortunately, almost all existing VQA models are built generally, measuring the quality of a video from a comprehensive perspective. As a result, Light-VQA, trained on LLVE-QA, is proposed for assessing LLVE. We extend the work of Light-VQA by expanding the LLVE-QA dataset into Video Exposure Correction Quality Assessment (VEC-QA) dataset with over-exposed videos and their corresponding corrected versions. In addition, we propose Light-VQA+, a VQA model specialized in assessing VEC. Light-VQA+ differs from Light-VQA mainly from the usage of the CLIP model and the vision-language guidance during the feature extraction, followed by a new module referring to the Human Visual System (HVS) for more accurate assessment. Extensive experimental results show that our model achieves the best performance against the current State-Of-The-Art (SOTA) VQA models on the VEC-QA dataset and other public datasets.
Paper Structure (25 sections, 19 equations, 8 figures, 6 tables)

This paper contains 25 sections, 19 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Representative frames of two original over-exposed videos and their corresponding recovered videos.
  • Figure 2: The scoring interface during the subjective experiment.
  • Figure 3: Framework of Light-VQA+. The model contains the spatial and temporal information extraction module via CLIP clipmodel, the feature fusion module via cross-attention, and the quality regression module with HVS 626903. Concretely, Spatial Information contains semantic, brightness, and noise features, while Temporal Information contains motion and brightness consistency features. [Key: SF: Semantic Features; BNF: Brightness & Noise Features; BCF: Brightness Consistency Features; MF: Motion Features]
  • Figure 4: The prompts utilized for extracting the brightness & noise information in Light-VQA+.
  • Figure 5: The structure of extracting the BNF and BCF from the third video clip. The "sub-videos" $V^{p}_{l,k}$ squared in red stand for the features that cover the third video clip. [Key: VF: Video Frames; BNF: Brightness & Noise Features; BCF: Brightness Consistency Features]
  • ...and 3 more figures