MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment
Yachun Mi, Yu Li, Weicheng Meng, Chaofeng Chen, Chen Hui, Shaohui Liu
TL;DR
MVQA introduces a Mamba-based, state-space-model approach to video quality assessment that achieves high efficiency by modeling long sequences with linear complexity. It couples a novel Unified Semantic and Distortion Sampling (USDS) strategy with a 3D-embedding MVQA architecture, enabling semantic preservation from downsampled frames alongside high-resolution distortion details via mask fusion. Empirical results across four datasets show MVQA-tiny matches or surpasses fast baselines with substantial speedups and memory savings, and MVQA-middle provides competitive or superior accuracy while remaining efficient. The work demonstrates a practical path for scalable, high-performance VQA on long videos by unifying sampling and sequence modeling through Mamba.
Abstract
The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$ GPU memory.
