Table of Contents
Fetching ...

Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search

Hengzhu Tang, Zefeng Zhang, Zhiping Li, Zhenyu Zhang, Xing Wu, Li Gao, Suqi Cheng, Dawei Yin

TL;DR

This work tackles the practical problem of video quality assessment in industrial video retrieval, where four distinct quality issues—visual artifacts, text misalignment, frame incoherence, and frame-text mismatches in AI-generated content—manifest in real deployments. It proposes the Multi-Branch Collaborative Learning Network (MBCN), combining a multimodal encoder (text via Chinese BERT and frames via ViT with a temporal encoder) and four specialized branches (VTMAB, FCAB, FQAB, TQAB) whose outputs are adaptively fused using a squeeze-and-excitation mechanism. The model is trained with a joint pointwise and pairwise loss to yield stable, discriminative quality scores, and is validated both offline on a large production-derived dataset and online in Baidu's world-scale video search engine, showing significant gains in ranking metrics and AI-generated video detection. The results highlight the practical value of decomposing quality assessment into targeted, branch-specific evaluations and aggregating them dynamically to improve industrial VQA and retrieval performance. $f(v,t)$ represents the video-text quality score, with labels and soft targets as described, enabling integration with practical ranking signals in production systems.

Abstract

Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN's effectiveness in identifying video quality issues, significantly enhancing the retrieval system's ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.

Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search

TL;DR

This work tackles the practical problem of video quality assessment in industrial video retrieval, where four distinct quality issues—visual artifacts, text misalignment, frame incoherence, and frame-text mismatches in AI-generated content—manifest in real deployments. It proposes the Multi-Branch Collaborative Learning Network (MBCN), combining a multimodal encoder (text via Chinese BERT and frames via ViT with a temporal encoder) and four specialized branches (VTMAB, FCAB, FQAB, TQAB) whose outputs are adaptively fused using a squeeze-and-excitation mechanism. The model is trained with a joint pointwise and pairwise loss to yield stable, discriminative quality scores, and is validated both offline on a large production-derived dataset and online in Baidu's world-scale video search engine, showing significant gains in ranking metrics and AI-generated video detection. The results highlight the practical value of decomposing quality assessment into targeted, branch-specific evaluations and aggregating them dynamically to improve industrial VQA and retrieval performance. represents the video-text quality score, with labels and soft targets as described, enabling integration with practical ranking signals in production systems.

Abstract

Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN's effectiveness in identifying video quality issues, significantly enhancing the retrieval system's ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.

Paper Structure

This paper contains 30 sections, 17 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Traditional visual-related low-quality characteristics in industrial video retrieval systems.
  • Figure 2: Characteristics of AI-generated low-quality videos in industrial video retrieval systems.
  • Figure 3: Illustration of the Multi-Branch Collaborative Learning Network (MBCN). It takes the text and frame images of the video as inputs to obtain text, frames, and video representations, where a frame encoder and a temporal encoder are combined as the frame encoder. Subsequently, four assessment branches are carefully designed to adapt to the four characteristics of low-quality videos in industrial video retrieval systems. Lastly, we perform a weighted aggregation of the various branches to dynamically address video quality issues in different scenarios with a squeeze-and-excitation mechanism.
  • Figure 4: Comparison of average model prediction scores under different labels.