Table of Contents
Fetching ...

Exploring Simple Siamese Network for High-Resolution Video Quality Assessment

Guotao Shen, Ziheng Yan, Xin Jin, Longhai Wu, Jie Chen, Ilhyun Cho, Cheul-Hee Hahm

TL;DR

This paper tackles high-resolution video quality assessment by arguing that technical quality must be interpreted in a semantic context. It proposes SiamVQA, a lightweight Siamese architecture that shares weights between technical and aesthetic branches and employs a dual cross-attention fusion to produce per-pixel quality maps that are then pooled for a final score. The model achieves state-of-the-art results on high-resolution benchmarks such as LSVQ$_{1080p}$, LIVE-Qualcomm, and YouTube-UGC, while remaining competitive on lower-resolution data, and does so with fewer parameters and faster runtime than several prior two-branch approaches. Overall, SiamVQA demonstrates that semantic-aware technical perception and effective multimodal fusion can significantly improve VQA performance without heavy model complexity.

Abstract

In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: https://github.com/srcn-ivl/SiamVQA

Exploring Simple Siamese Network for High-Resolution Video Quality Assessment

TL;DR

This paper tackles high-resolution video quality assessment by arguing that technical quality must be interpreted in a semantic context. It proposes SiamVQA, a lightweight Siamese architecture that shares weights between technical and aesthetic branches and employs a dual cross-attention fusion to produce per-pixel quality maps that are then pooled for a final score. The model achieves state-of-the-art results on high-resolution benchmarks such as LSVQ, LIVE-Qualcomm, and YouTube-UGC, while remaining competitive on lower-resolution data, and does so with fewer parameters and faster runtime than several prior two-branch approaches. Overall, SiamVQA demonstrates that semantic-aware technical perception and effective multimodal fusion can significantly improve VQA performance without heavy model complexity.

Abstract

In the research of video quality assessment (VQA), two-branch network has emerged as a promising solution. It decouples VQA with separate technical and aesthetic branches to measure the perception of low-level distortions and high-level semantics respectively. However, we argue that while technical and aesthetic perspectives are complementary, the technical perspective itself should be measured in semantic-aware manner. We hypothesize that existing technical branch struggles to perceive the semantics of high-resolution videos, as it is trained on local mini-patches sampled from videos. This issue can be hidden by apparently good results on low-resolution videos, but indeed becomes critical for high-resolution VQA. This work introduces SiamVQA, a simple but effective Siamese network for highre-solution VQA. SiamVQA shares weights between technical and aesthetic branches, enhancing the semantic perception ability of technical branch to facilitate technical-quality representation learning. Furthermore, it integrates a dual cross-attention layer for fusing technical and aesthetic features. SiamVQA achieves state-of-the-art accuracy on high-resolution benchmarks, and competitive results on lower-resolution benchmarks. Codes will be available at: https://github.com/srcn-ivl/SiamVQA

Paper Structure

This paper contains 25 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the fragments for technical perspective and down-sampled frames for aesthetic perspective. We observe that the fragments sampled from high-resolution videos (e.g., 1080p) suffer from serious semantic degeneration, although the fragments of lower-resolution videos can preserve the semantics to a large extent. In this example, even human can hardly tell the semantics of original 1080p video, purely based on the sampled fragments.
  • Figure 2: Illustration of our argumentation that technical quality should be measured in context. From first to third rows, we show videos of snowy scene, railway in forest with the camera lens moving forward fast, and live show, respectively. Without considering the semantics, many local patches in first two videos are of low technical quality (low lighting, motion blur). But with semantics, to a large extent these local patches are natural, because heavy snow always leads to low-lighting, and through the lens of a fast-moving camera, nearby objects are always more blurred than distant objects.
  • Figure 3: Architecture of our SiamVQA, with comparison to DOVER wu2023exploring.
  • Figure 4: Visualization of branch-level and merged quality maps, produced by DOVER wu2023exploring and our SiamVQA. For these two examples, SiamVQA gives more higher technical scores on low-lighting and blur regions, as these low-level distortion appears in snowy scene, and video captured by fasting moving camera. The true quality scores of these two examples are 63.8 and 62.9; predictions by DOVER are 49.11 and 38.11; predictions by SiamVQA are 63.04 and 62.19.