Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search
Hengzhu Tang, Zefeng Zhang, Zhiping Li, Zhenyu Zhang, Xing Wu, Li Gao, Suqi Cheng, Dawei Yin
TL;DR
This work tackles the practical problem of video quality assessment in industrial video retrieval, where four distinct quality issues—visual artifacts, text misalignment, frame incoherence, and frame-text mismatches in AI-generated content—manifest in real deployments. It proposes the Multi-Branch Collaborative Learning Network (MBCN), combining a multimodal encoder (text via Chinese BERT and frames via ViT with a temporal encoder) and four specialized branches (VTMAB, FCAB, FQAB, TQAB) whose outputs are adaptively fused using a squeeze-and-excitation mechanism. The model is trained with a joint pointwise and pairwise loss to yield stable, discriminative quality scores, and is validated both offline on a large production-derived dataset and online in Baidu's world-scale video search engine, showing significant gains in ranking metrics and AI-generated video detection. The results highlight the practical value of decomposing quality assessment into targeted, branch-specific evaluations and aggregating them dynamically to improve industrial VQA and retrieval performance. $f(v,t)$ represents the video-text quality score, with labels and soft targets as described, enabling integration with practical ranking signals in production systems.
Abstract
Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN's effectiveness in identifying video quality issues, significantly enhancing the retrieval system's ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.
