An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM
Wen Wen, Yilin Wang, Neil Birkbeck, Balu Adsumilli
TL;DR
The paper tackles the generalization gap of blind video quality assessment (BVQA) for short-form content by leveraging a pretrained multimodal large language model (MLLM) to produce a quality score $q_p$, which is then fused with a BVQA predictor $q_l$ via a learned ensemble weight $\\alpha$ to form an ensemble score $q_e = \\alpha q_p + (1-\\alpha) q_l$. It systematically studies frame sampling, prompts, and robust inference strategies for the MLLM on Shorts-SDR and Shorts-HDR2SDR, finding that level-related prompts, cropping, and high nucleus sampling improve correlations. The authors then introduce a content-aware ensemble that extracts SigLIP features from $M$ key frames (with $H_b\times W_b = 448^2$) to predict $\\alpha$ through a two-layer MLP, demonstrating superior generalization over individual BVQA models on both short-form datasets. The analysis of learned weights reveals content regimes where BVQA models struggle, offering actionable directions to strengthen BVQA, and the approach provides a practical, modular path to robust BVQA in evolving short-form video content, with potential for this fusion strategy to generalize to other multimodal BVQA settings. $q_e$ thus embodies a robust, adaptive fusion mechanism that leverages the strengths of MLLMs in complex, edited short-form videos.
Abstract
The rise of short-form videos, characterized by diverse content, editing styles, and artifacts, poses substantial challenges for learning-based blind video quality assessment (BVQA) models. Multimodal large language models (MLLMs), renowned for their superior generalization capabilities, present a promising solution. This paper focuses on effectively leveraging a pretrained MLLM for short-form video quality assessment, regarding the impacts of pre-processing and response variability, and insights on combining the MLLM with BVQA models. We first investigated how frame pre-processing and sampling techniques influence the MLLM's performance. Then, we introduced a lightweight learning-based ensemble method that adaptively integrates predictions from the MLLM and state-of-the-art BVQA models. Our results demonstrated superior generalization performance with the proposed ensemble approach. Furthermore, the analysis of content-aware ensemble weights highlighted that some video characteristics are not fully represented by existing BVQA models, revealing potential directions to improve BVQA models further.
