Table of Contents
Fetching ...

DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

Li Yu, Situo Wang, Wei Zhou, Moncef Gabbouj

TL;DR

The paper tackles no-reference video quality assessment for in-the-wild content by introducing DVLTA-VQA, a decoupled CLIP-based framework that splits CLIP into visual and textual paths to mimic ventral and dorsal streams. It adds a Video-Based Temporal CLIP and a Temporal Context Module for refined temporal dynamics, a Basic Visual Feature Extraction module for detail analysis, and a Text-Guided Adaptive Fusion to dynamically fuse multi-modal features guided by textual semantics, ultimately predicting quality via prompt-based alignment. Empirical results on KoNViD-1k, LIVE-VQC, and YouTube-UGC show state-of-the-art SROCC/PLCC performance and strong cross-dataset generalization, with ablations confirming the critical roles of TAdaConv and the fusion strategy. The work demonstrates that brain-inspired decoupling and cross-modal fusion can markedly improve NR-VQA, with practical implications for streaming, monitoring, and quality-aware processing of user-generated video content.

Abstract

Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline. Specifically, a Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream. Additionally, a Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling. On the ventral stream side, a Basic Visual Feature Extraction Module is employed to strengthen detail analysis. Finally, a text-guided adaptive fusion strategy is proposed to enable dynamic weighting of features, facilitating more effective integration of spatial and temporal information.

DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment

TL;DR

The paper tackles no-reference video quality assessment for in-the-wild content by introducing DVLTA-VQA, a decoupled CLIP-based framework that splits CLIP into visual and textual paths to mimic ventral and dorsal streams. It adds a Video-Based Temporal CLIP and a Temporal Context Module for refined temporal dynamics, a Basic Visual Feature Extraction module for detail analysis, and a Text-Guided Adaptive Fusion to dynamically fuse multi-modal features guided by textual semantics, ultimately predicting quality via prompt-based alignment. Empirical results on KoNViD-1k, LIVE-VQC, and YouTube-UGC show state-of-the-art SROCC/PLCC performance and strong cross-dataset generalization, with ablations confirming the critical roles of TAdaConv and the fusion strategy. The work demonstrates that brain-inspired decoupling and cross-modal fusion can markedly improve NR-VQA, with practical implications for streaming, monitoring, and quality-aware processing of user-generated video content.

Abstract

Inspired by the dual-stream theory of the human visual system (HVS) - where the ventral stream is responsible for object recognition and detail analysis, while the dorsal stream focuses on spatial relationships and motion perception - an increasing number of video quality assessment (VQA) works built upon this framework are proposed. Recent advancements in large multi-modal models, notably Contrastive Language-Image Pretraining (CLIP), have motivated researchers to incorporate CLIP into dual-stream-based VQA methods. This integration aims to harness the model's superior semantic understanding capabilities to replicate the object recognition and detail analysis in ventral stream, as well as spatial relationship analysis in dorsal stream. However, CLIP is originally designed for images and lacks the ability to capture temporal and motion information inherent in videos. To address the limitation, this paper propose a Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment (DVLTA-VQA), which decouples CLIP's visual and textual components, and integrates them into different stages of the NR-VQA pipeline. Specifically, a Video-Based Temporal CLIP module is proposed to explicitly model temporal dynamics and enhance motion perception, aligning with the dorsal stream. Additionally, a Temporal Context Module is developed to refine inter-frame dependencies, further improving motion modeling. On the ventral stream side, a Basic Visual Feature Extraction Module is employed to strengthen detail analysis. Finally, a text-guided adaptive fusion strategy is proposed to enable dynamic weighting of features, facilitating more effective integration of spatial and temporal information.

Paper Structure

This paper contains 22 sections, 15 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The proposed method consists of Ventral Stream(blue), Dorsal Stream(yellow) and Text-Guided Adaptive Fusion(green). The Ventral Stream extracts low-level visual features through the Basic Visual Feature Extraction Module, while the Dorsal Stream incorporates the Video-Based Temporal CLIP and Temporal Context Module for capturing high-level semantic features and fine-grained inter-frame temporal information. Features from both streams are then fused in the Text-Guided Adaptive Fusion block. The Text-Adapter integrates textual features to guide the fusion process, and the combined features are used for quality prediction.
  • Figure 2: Ablation results of different feature fusion strategies on the LIVE-VQC dataset.
  • Figure 3: Scatter plots of subjective quality scores versus predicted scores across different datasets. (a)LIVEVQC. (b) KoNVid-1k. (c) YouTube-UGC.
  • Figure 4: This figure illustrates the differences in attention between coarse-grained and fine-grained temporal dynamic information. The top row shows original video frames. The second row shows heatmaps highlighting the model's focus on prominent motion, such as the moving shoreline, while the bottom row demonstrates the model's ability to capture finer details, including the continuously moving waves.