BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Yiran Yang; Zhaowei Liu; Yuan Yuan; Yukun Song; Xiong Ma; Yinghao Song; Xiangji Zeng; Lu Sun; Yulu Wang; Hai Zhou; Shuai Cui; Zhaohan Gong; Jiefei Zhang

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Yiran Yang, Zhaowei Liu, Yuan Yuan, Yukun Song, Xiong Ma, Yinghao Song, Xiangji Zeng, Lu Sun, Yulu Wang, Hai Zhou, Shuai Cui, Zhaohan Gong, Jiefei Zhang

TL;DR

The paper addresses policy-driven moderation of multimodal short-video ads, where deceptive visuals, audio, and subtitles require precise, explainable checks implemented via $\mathbf{v}$-level reasoning and policy guidance. It introduces BLM-Guard, a framework that combines Interleaved-modal Chain-of-Thought (ICoT) reasoning with rule-based policy priors and a self-adaptive GRPO reinforcement learning loop to align outputs with platform guidelines. A dedicated BLM-Guard Benchmark provides a three-level taxonomy (Severity, Scenario, Violation Type) and a data synthesis pipeline to support policy-grounded evaluation; results show improvements in accuracy, consistency, and generalization over strong baselines. The approach advances practical moderation by delivering explainable decisions and robust handling of cross-modal mismatches and policy drift, with clear potential for real-world deployment in short-video ad platforms.

Abstract

Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

TL;DR

The paper addresses policy-driven moderation of multimodal short-video ads, where deceptive visuals, audio, and subtitles require precise, explainable checks implemented via

-level reasoning and policy guidance. It introduces BLM-Guard, a framework that combines Interleaved-modal Chain-of-Thought (ICoT) reasoning with rule-based policy priors and a self-adaptive GRPO reinforcement learning loop to align outputs with platform guidelines. A dedicated BLM-Guard Benchmark provides a three-level taxonomy (Severity, Scenario, Violation Type) and a data synthesis pipeline to support policy-grounded evaluation; results show improvements in accuracy, consistency, and generalization over strong baselines. The approach advances practical moderation by delivering explainable decisions and robust handling of cross-modal mismatches and policy drift, with clear potential for real-world deployment in short-video ad platforms.

Abstract

Paper Structure (35 sections, 12 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 12 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Our main contributions are:
Task Formulation and BLM-Guard Benchmark
Task Definition
BLM-Guard Benchmark
Data Construction
Methodology
Cold Start Strategy: Rule-Guided Causal Supervision
Keyframe and Region Extraction.
Interleaved Multi-stage CoT Generation.
Rule-Anchored Supervised Fine-tuning (SFT)
Reinforcement Learning: Principle-Guided Self-Consistent Optimization
(1) Rejection Sampling.
(2) Safety-Aware Concatenation.
Reward Design
...and 20 more sections

Figures (3)

Figure 1: BLM-Guard Benchmark Taxonomy. Our benchmark organizes commercial short-video ads into a hierarchical risk taxonomy with seven core violation scenarios and fine-grained subtypes. Each node reflects a policy-sensitive violation type (e.g., income exaggeration, privacy leak, feudal superstition), and is further associated with a severity level (high, medium, low). This structure supports interpretable supervision and enables fine-grained evaluation of model performance on diverse and nuanced moderation cases. A dedicated “No Risk” category is also included to balance risk distribution.
Figure 2: Our method adopts a progressive two-stage pipeline for policy-controllable content moderation. In Stage 1 (Rule-driven SFT Cold Start), we synthesize structured visual-language Chain-of-Thought (ICoT) data via keyframe selection and multi-step prompting using InternVL. This enables supervised fine-tuning with rule-anchored causal supervision. In Stage 2 (Self-adaptive GRPO Reinforcement Learning), we apply a safety-aware data curation strategy and propose a Self-Adaptive Critique Reward (SACR) to dynamically evaluate reasoning outputs. The model is optimized using a modified Group-wise Relative Policy Optimization (GRPO) algorithm with token-level normalization and dynamic sampling. This multi-stage process enables the model to first learn fine-grained compliance reasoning and then refine its moderation behavior through adaptive, reward-driven training.
Figure 3: Comprehensive comparison of model strategies and components. Left: Radar plots highlight improvements in accuracy, precision, and consistency across key metrics. Middle: Heatmaps visualize performance across fine-grained risk scenarios and external benchmarks. Right: Histograms show the distributional gains in consistency induced by Rule-SFT and SCA-R reward learning.

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

TL;DR

Abstract

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Authors

TL;DR

Abstract

Table of Contents

Figures (3)