Multimodal Guidance Network for Missing-Modality Inference in Content Moderation
Zhuokai Zhao, Harish Palani, Tianyi Liu, Lena Evans, Ruth Toner
TL;DR
This work tackles missing-modality inference in multimodal content moderation by introducing a guidance network that preserves multimodal training benefits while producing strong single-modality models for inference. The method uses cross-modality attention to re-weight image embeddings based on text-derived guidance, enabling knowledge transfer from text–image fusion to the image encoder without increasing inference latency. Empirical results on violence detection show substantial accuracy gains over CLIP and MobileOne baselines, with the best configuration reaching about 98% accuracy and sub-millisecond latency, demonstrating practical efficiency for large-scale moderation systems. The approach opens avenues for extending guidance with additional modalities and more sophisticated attention mechanisms to broaden applicability beyond two modalities and toward video understanding.
Abstract
Multimodal deep learning, especially vision-language models, have gained significant traction in recent years, greatly improving performance on many downstream tasks, including content moderation and violence detection. However, standard multimodal approaches often assume consistent modalities between training and inference, limiting applications in many real-world use cases, as some modalities may not be available during inference. While existing research mitigates this problem through reconstructing the missing modalities, they unavoidably increase unnecessary computational cost, which could be just as critical, especially for large, deployed infrastructures in industry. To this end, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models to be used for inference. Real-world experiments in violence detection shows that our proposed framework trains single-modality models that significantly outperform traditionally trained counterparts, while avoiding increases in computational cost for inference.
