Table of Contents
Fetching ...

An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM

Wen Wen, Yilin Wang, Neil Birkbeck, Balu Adsumilli

TL;DR

The paper tackles the generalization gap of blind video quality assessment (BVQA) for short-form content by leveraging a pretrained multimodal large language model (MLLM) to produce a quality score $q_p$, which is then fused with a BVQA predictor $q_l$ via a learned ensemble weight $\\alpha$ to form an ensemble score $q_e = \\alpha q_p + (1-\\alpha) q_l$. It systematically studies frame sampling, prompts, and robust inference strategies for the MLLM on Shorts-SDR and Shorts-HDR2SDR, finding that level-related prompts, cropping, and high nucleus sampling improve correlations. The authors then introduce a content-aware ensemble that extracts SigLIP features from $M$ key frames (with $H_b\times W_b = 448^2$) to predict $\\alpha$ through a two-layer MLP, demonstrating superior generalization over individual BVQA models on both short-form datasets. The analysis of learned weights reveals content regimes where BVQA models struggle, offering actionable directions to strengthen BVQA, and the approach provides a practical, modular path to robust BVQA in evolving short-form video content, with potential for this fusion strategy to generalize to other multimodal BVQA settings. $q_e$ thus embodies a robust, adaptive fusion mechanism that leverages the strengths of MLLMs in complex, edited short-form videos.

Abstract

The rise of short-form videos, characterized by diverse content, editing styles, and artifacts, poses substantial challenges for learning-based blind video quality assessment (BVQA) models. Multimodal large language models (MLLMs), renowned for their superior generalization capabilities, present a promising solution. This paper focuses on effectively leveraging a pretrained MLLM for short-form video quality assessment, regarding the impacts of pre-processing and response variability, and insights on combining the MLLM with BVQA models. We first investigated how frame pre-processing and sampling techniques influence the MLLM's performance. Then, we introduced a lightweight learning-based ensemble method that adaptively integrates predictions from the MLLM and state-of-the-art BVQA models. Our results demonstrated superior generalization performance with the proposed ensemble approach. Furthermore, the analysis of content-aware ensemble weights highlighted that some video characteristics are not fully represented by existing BVQA models, revealing potential directions to improve BVQA models further.

An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM

TL;DR

The paper tackles the generalization gap of blind video quality assessment (BVQA) for short-form content by leveraging a pretrained multimodal large language model (MLLM) to produce a quality score , which is then fused with a BVQA predictor via a learned ensemble weight to form an ensemble score . It systematically studies frame sampling, prompts, and robust inference strategies for the MLLM on Shorts-SDR and Shorts-HDR2SDR, finding that level-related prompts, cropping, and high nucleus sampling improve correlations. The authors then introduce a content-aware ensemble that extracts SigLIP features from key frames (with ) to predict through a two-layer MLP, demonstrating superior generalization over individual BVQA models on both short-form datasets. The analysis of learned weights reveals content regimes where BVQA models struggle, offering actionable directions to strengthen BVQA, and the approach provides a practical, modular path to robust BVQA in evolving short-form video content, with potential for this fusion strategy to generalize to other multimodal BVQA settings. thus embodies a robust, adaptive fusion mechanism that leverages the strengths of MLLMs in complex, edited short-form videos.

Abstract

The rise of short-form videos, characterized by diverse content, editing styles, and artifacts, poses substantial challenges for learning-based blind video quality assessment (BVQA) models. Multimodal large language models (MLLMs), renowned for their superior generalization capabilities, present a promising solution. This paper focuses on effectively leveraging a pretrained MLLM for short-form video quality assessment, regarding the impacts of pre-processing and response variability, and insights on combining the MLLM with BVQA models. We first investigated how frame pre-processing and sampling techniques influence the MLLM's performance. Then, we introduced a lightweight learning-based ensemble method that adaptively integrates predictions from the MLLM and state-of-the-art BVQA models. Our results demonstrated superior generalization performance with the proposed ensemble approach. Furthermore, the analysis of content-aware ensemble weights highlighted that some video characteristics are not fully represented by existing BVQA models, revealing potential directions to improve BVQA models further.

Paper Structure

This paper contains 8 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The video in the green frame is from YouTube-UGC wang2019youtube, while the video in the red frame is sourced from Shorts-SDR wang2024youtube. Despite sharing similar local artifacts, the short-form video contains more edited content and demonstrates a significantly higher mean opinion score (MOS). However, the learning-based model generates similar predictions for both videos, ultimately underestimating the quality of the short-form video.
  • Figure 2: The system diagram illustrates our MLLM evaluation methodology and the content-aware ensemble method, with only the gray part being tuned. In the MLLM evaluation process, videos are initially downsampled to key frames, which serve as input. These key frames are subsequently resized or cropped and subjected to multiple zero-shot prompting experiments employing various sampler techniques. The resulting numerous outputs are aggregated to produce a score, with the average serving as the MLLM's final prediction $q_p$. In the content-aware ensemble method, only the image features from the vision encoder are employed. A learned weight $\alpha$ is calculated to ensemble the predictions $q_p$ from the MLLM with the predictions $q_l$ from the learning-based models.
  • Figure 3: Distributions of standard deviations across varying numbers of zero-shot trials per frame. Level-related prompting, cropping pre-processing, and a nucleus sampler set at $0.9$ are utilized. The default trial number is $200$, involving $10$ crops with $20$ trials each.
  • Figure 4: Thumbnails of videos where BVQA models tend to underestimate, yet with MLLM can provide superior predictions. (a) and (b) are from Shorts-SDR, while (c) and (d) are from Shorts-HDR2SDR.
  • Figure 5: Thumbnails of videos where BVQA models tend to overestimate, yet with MLLM can provide superior predictions. (a) and (b) are from Shorts-SDR, while (c) and (d) are from Shorts-HDR2SDR.