Table of Contents
Fetching ...

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

Yachun Mi, Yu Li, Weicheng Meng, Chaofeng Chen, Chen Hui, Shaohui Liu

TL;DR

MVQA introduces a Mamba-based, state-space-model approach to video quality assessment that achieves high efficiency by modeling long sequences with linear complexity. It couples a novel Unified Semantic and Distortion Sampling (USDS) strategy with a 3D-embedding MVQA architecture, enabling semantic preservation from downsampled frames alongside high-resolution distortion details via mask fusion. Empirical results across four datasets show MVQA-tiny matches or surpasses fast baselines with substantial speedups and memory savings, and MVQA-middle provides competitive or superior accuracy while remaining efficient. The work demonstrates a practical path for scalable, high-performance VQA on long videos by unifying sampling and sequence modeling through Mamba.

Abstract

The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$ GPU memory.

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

TL;DR

MVQA introduces a Mamba-based, state-space-model approach to video quality assessment that achieves high efficiency by modeling long sequences with linear complexity. It couples a novel Unified Semantic and Distortion Sampling (USDS) strategy with a 3D-embedding MVQA architecture, enabling semantic preservation from downsampled frames alongside high-resolution distortion details via mask fusion. Empirical results across four datasets show MVQA-tiny matches or surpasses fast baselines with substantial speedups and memory savings, and MVQA-middle provides competitive or superior accuracy while remaining efficient. The work demonstrates a practical path for scalable, high-performance VQA on long videos by unifying sampling and sequence modeling through Mamba.

Abstract

The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being as fast and requiring only GPU memory.

Paper Structure

This paper contains 17 sections, 13 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance and efficiency comparisons between FAST-VQApaper29 and our MVQA-tiny. Our method achieves comparable results while running up to $2\times$ faster than the current most efficient FAST-VQA, and reducing GPU memory usage by $5.3\times$. This advantage further scales with increased video frames and batch size.
  • Figure 2: Comparison of semantic information retained by USDS with different samples (resize, Fragments paper29).
  • Figure 3: The framework of our proposed method. (a) The USDS consists of three distinct but interrelated stages: distortion details extraction, semantic information retention, and fusion of all resolutions. (b) MVQA transforms the input image blocks into one-dimensional vectors by 3D embedding and adds spatial location embedding and temporal location embedding to them, then extracts the features of the video using a bidirectional vision mamba encoder paper84, and finally predicts the quality scores of the video by regressing the header.
  • Figure 4: Comparison of USDS with crop, resize, MRET paper102, Fragments paper29 sampling.
  • Figure 5: Scan method.
  • ...and 2 more figures