Table of Contents
Fetching ...

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

TL;DR

The paper tackles the difficulty of real-time moderation for live streams by proposing a hybrid framework that combines a supervised preset violation classifier with a reference-based similarity retrieval system. A knowledge-distillation pipeline from a frozen multimodal teacher (MLLM) guides lightweight student components for both the classifier and re-ranking stages, enabling low-latency deployment. Extensive offline and online evaluations on production-scale data show strong gains in coverage and precision, with substantial reductions in unwanted and duplicate streams. The results demonstrate that pairing high-precision known-violation detection with flexible, retrieval-based generalization yields scalable, adaptable content governance for rapidly evolving livestream platforms.

Abstract

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

TL;DR

The paper tackles the difficulty of real-time moderation for live streams by proposing a hybrid framework that combines a supervised preset violation classifier with a reference-based similarity retrieval system. A knowledge-distillation pipeline from a frozen multimodal teacher (MLLM) guides lightweight student components for both the classifier and re-ranking stages, enabling low-latency deployment. Extensive offline and online evaluations on production-scale data show strong gains in coverage and precision, with substantial reductions in unwanted and duplicate streams. The results demonstrate that pairing high-precision known-violation detection with flexible, retrieval-based generalization yields scalable, adaptable content governance for rapidly evolving livestream platforms.

Abstract

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

Paper Structure

This paper contains 22 sections, 6 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our proposed hybrid moderation framework. Incoming live-stream clips are segmented into 20-second windows and processed into multimodal inputs (frames, audio, text). The system follows two parallel paths: (1) Preset Violation Detection using a supervised multiclass classification model trained on labeled data, and (2) Reference Matching through a similarity-based retrieval system. Retrieved candidates are refined by a multimodal re-ranking model that evaluates semantic alignment across modalities. Enforcement decisions for Live Streams are made predefined decision rules.
  • Figure 2: Architecture of the supervised classification pipeline for preset violation detection. The framework consists of a frozen LLaVA-One-Visionllava_ov model after supervised fine-tuning (SFT) serving as a teacher model. The student model learns to align with the teacher outputs via MSE loss for last hidden state and KL Divergence loss for logits.
  • Figure 3: Training pipeline for the video-clip retrieval feature model. The visual encoder is trained using the MoCohe2020moco framework with a momentum encoder and memory bank. To enhance semantic richness, we incorporate CLIPradford2021clip losses between visual-text and visual-audio embeddings for cross-modality alignment. Additionally, caption supervision is introduced via a second-pass of text decoder using cross-attention on visual features.
  • Figure 4: Small Re-ranking Model: the small/lightweight multimodal re-ranking model that integrates multimodal embeddings via bi-directional cross-attention layers. Feature maps obtained are then processed by a ResNethe2016deep-MLP architecture for scoring. This model is also knowledge-distilled from a fine-tuned LLaVA-One-Vision model, similar to the approach shown in Figure \ref{['fig:rc_finegrain']}.
  • Figure 5: Knowledge Distillation of Re-ranking model: the knowledge distillation process where a frozen LLaVA-One-Visionli2024llava_onevision teacher provides soft labels to guide the student re-ranking model. The student (Small Re-ranking Model) is trained with KL divergence and MSE losses to match teacher predictions for efficient deployment.