Table of Contents
Fetching ...

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Qihang Ge, Wei Sun, Yu Zhang, Yunhao Li, Zhongpeng Ji, Fengyu Sun, Shangling Jui, Xiongkuo Min, Guangtao Zhai

TL;DR

This paper introduces LMM-VQA, a large multimodal video quality assessment framework that couples a spatiotemporal visual encoder with modality-alignment projectors and a language model to predict video quality scores and levels. By reformulating VQA as a Q&A task and employing instruction-tuned prompts, LMM-VQA leverages both spatial and temporal cues via independent ViT-based spatial and SlowFast temporal encoders, aligned to language space for LLM-based scoring. The approach achieves state-of-the-art performance on five BVQA benchmarks and demonstrates strong generalization to out-of-distribution datasets and general video understanding tasks, validating the effectiveness of temporal-aware multimodal instruction tuning. The work highlights the value of explicit temporal modeling and prompt-driven alignment in LMMs for reliable perceptual quality assessment, with practical implications for QoE optimization in streaming applications.

Abstract

The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5\%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at https://github.com/Sueqk/LMM-VQA.

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

TL;DR

This paper introduces LMM-VQA, a large multimodal video quality assessment framework that couples a spatiotemporal visual encoder with modality-alignment projectors and a language model to predict video quality scores and levels. By reformulating VQA as a Q&A task and employing instruction-tuned prompts, LMM-VQA leverages both spatial and temporal cues via independent ViT-based spatial and SlowFast temporal encoders, aligned to language space for LLM-based scoring. The approach achieves state-of-the-art performance on five BVQA benchmarks and demonstrates strong generalization to out-of-distribution datasets and general video understanding tasks, validating the effectiveness of temporal-aware multimodal instruction tuning. The work highlights the value of explicit temporal modeling and prompt-driven alignment in LMMs for reliable perceptual quality assessment, with practical implications for QoE optimization in streaming applications.

Abstract

The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at https://github.com/Sueqk/LMM-VQA.
Paper Structure (29 sections, 10 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overall performance comparison between LMM-VQA and existing state-of-the-arts methods. KoNViD-1k$^\text{OOD}$ indicates that the model is trained on the LSVQ dataset and evaluated on the KoNViD-1k dataset. KoNViD-1k$^\text{FT}$ refers that the model is pre-trained on the LSVQ dataset, fine-tuned on KoNViD-1k, and evaluated on the KoNViD-1k dataset. LSVQ$^\text{in-sample}$ represents that the model is trained on the LSVQ dataset and evaluated on the LSVQ dataset. Metrics are (SRCC+PLCC)/2.
  • Figure 2: The construction of pair-wised Q&A instruction prompts for the training of LMM-VQA. Given the video input, and its quality score, we construct pair-wised Q&A instruction prompts for two tasks: quality score regression and quality classification. We leverage the GPT-4 annotator to generate 2000 templates of quality prompts, which include system prompts, instruction prompts, and response restrictions.
  • Figure 3: Illustration of the video preprocessor. We slice the video into non-overlapping consecutive chunks every $\tau$ frames. The first frame of $j$-th chunk $\mathbf v^j$ is selected as key frame $\mathbf y^j$.
  • Figure 4: Technical comparisons with other methods. Left: LLaVA takes the images as input only; Middle: the core architecture of Video-LLaVA. It shares a unified V-L translator for images and videos due to the small modality gap between the two modalities; Right: Our proposed method. We take two separate projectors for modality alignment to bridge the larger modality gap between video with temporal information and text.
  • Figure 5: The framework of LMM-VQA: LMM-VQA takes video frames as input, and generates responses of quality scores. The process initiates with two vision encoders that transform input frames into spatiotemporal visual features. These visual features are incorporated into two projection modules to generate visual tokens that are aligned with the text-related language space. Meanwhile, the text decoder produces scale tokens based on the quality prompt from users. Then, text-guided spatial and temporal tokens, and quality prompt tokens are aggregated as input for LLMs to generate the final answers.
  • ...and 1 more figures