Table of Contents
Fetching ...

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Zixuan Wang, Yu Sun, Hongwei Wang, Baoyu Jing, Xiang Shen, Xin Dong, Zhuolin Hao, Hongyu Xiong, Yang Song

TL;DR

This work addresses the need for scalable, cross-issue content governance on short-video platforms by moving from per-issue classifiers to a unified multimodal language model. It introduces domain-adaptive pretraining that leverages three tasks—Caption, Visual Question Answering (VQA), and Chain-of-Thought (CoT)—to teach the model to perceive fine-grained video details, understand complex annotation guidelines, and perform structured reasoning; data for these tasks are generated by an annotator MLLM. Experiments with open-source MLLMs demonstrate significant zero-shot gains and improved data efficiency under supervised fine-tuning, with strong generalization to unseen issues like Shocking Graphic Content. The approach promises reduced labeling costs and shorter development cycles, enabling more robust and scalable governance in real-world deployments.

Abstract

Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM's perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

TL;DR

This work addresses the need for scalable, cross-issue content governance on short-video platforms by moving from per-issue classifiers to a unified multimodal language model. It introduces domain-adaptive pretraining that leverages three tasks—Caption, Visual Question Answering (VQA), and Chain-of-Thought (CoT)—to teach the model to perceive fine-grained video details, understand complex annotation guidelines, and perform structured reasoning; data for these tasks are generated by an annotator MLLM. Experiments with open-source MLLMs demonstrate significant zero-shot gains and improved data efficiency under supervised fine-tuning, with strong generalization to unseen issues like Shocking Graphic Content. The approach promises reduced labeling costs and shorter development cycles, enabling more robust and scalable governance in real-world deployments.

Abstract

Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM's perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of our domain-adaptive pretraining approach for short video content governance. For each issue type, we first decompose the annotation guidelines into a set of sub-questions to assist pretraining data generation. An annotator MLLM is deployed to produce three types of pretraining data: Caption, VQA, and CoT, enabling the model to mimic human-like reasoning. The model can be pretrained using three different strategies, and the pretrained model is finally evaluated in both zero-shot and SFT settings.
  • Figure 2: ROC-AUC of vanilla and pretrained models with different model sizes under zero-shot evaluation.
  • Figure 3: Left: An illustrative example of short video frames that violates the SSC policy. Right: Zero-shot evaluation prompt and the MLLM’s responses before and after our pretraining.
  • Figure 4: An illustrative example of prompts and the generated pretraining data. The first column is the prompt for generating the pretraining data. The second column is the output generated by the prompt in the first column. The third column is the prompt used during pretraining.