Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling
Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu
TL;DR
Hi-Guard tackles the misalignment, opacity, and fine-grained challenges of multimodal content moderation by deploying a policy-aligned, two-stage cascade with a hierarchical taxonomy (Domain → Topic → Subtype → Behavior). It grounds decisions in explicit moderation rules through structured prompts, incorporates chain-of-thought reasoning, and optimizes with Group Relative Policy Optimization (GRPO) using a multi-level soft-margin reward. The four-level taxonomy enables path-based classification with reduced search space and improved generalization, while the reward design penalizes sibling confusions and emphasizes deeper, policy-relevant distinctions. Empirical results on offline datasets and online deployment show improved accuracy, interpretability, and efficiency, along with substantial reductions in human moderation workload. This approach demonstrates a scalable, transparent framework for policy-aligned content safety in real-world platforms, with practical impact on safety and governance processes.
Abstract
Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.
