Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

Zhuokai Zhao; Harish Palani; Tianyi Liu; Lena Evans; Ruth Toner

Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

Zhuokai Zhao, Harish Palani, Tianyi Liu, Lena Evans, Ruth Toner

TL;DR

This work tackles missing-modality inference in multimodal content moderation by introducing a guidance network that preserves multimodal training benefits while producing strong single-modality models for inference. The method uses cross-modality attention to re-weight image embeddings based on text-derived guidance, enabling knowledge transfer from text–image fusion to the image encoder without increasing inference latency. Empirical results on violence detection show substantial accuracy gains over CLIP and MobileOne baselines, with the best configuration reaching about 98% accuracy and sub-millisecond latency, demonstrating practical efficiency for large-scale moderation systems. The approach opens avenues for extending guidance with additional modalities and more sophisticated attention mechanisms to broaden applicability beyond two modalities and toward video understanding.

Abstract

Multimodal deep learning, especially vision-language models, have gained significant traction in recent years, greatly improving performance on many downstream tasks, including content moderation and violence detection. However, standard multimodal approaches often assume consistent modalities between training and inference, limiting applications in many real-world use cases, as some modalities may not be available during inference. While existing research mitigates this problem through reconstructing the missing modalities, they unavoidably increase unnecessary computational cost, which could be just as critical, especially for large, deployed infrastructures in industry. To this end, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models to be used for inference. Real-world experiments in violence detection shows that our proposed framework trains single-modality models that significantly outperform traditionally trained counterparts, while avoiding increases in computational cost for inference.

Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

TL;DR

Abstract

Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Introduction
Methodology
Text Embedding
Image Embedding and Image-Text Fusion
Text-Guided Image Embedding Re-Weighting
Experiment
Data Collection
Baseline Approaches
Contrastive Language Image Pretraining (CLIP).
MobileOne.
Results of Our Guidance Approach
Discussions
Conclusions and Future Work

Figures (2)

Figure 1: An overview of the proposed guidance network in a vision-language setup. The network begins with encoding text and image features separately, fusing the embeddings from both modalities, and then applies self-attention to obtain the attention map. However, instead of applying the attention map back to the fusion embeddings, we apply it to the image-only embeddings to promote knowledge sharing from cross-modality features to better singlemodal attention.
Figure 2: Illustration of CLIP zero-shot classification. CLIP encodes both image and all potential class captions, compares similarity scores between image and each text embedding and takes the higher one as the classification result. In the case showed here, it classifies the input image as non-violent.

Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

TL;DR

Abstract

Multimodal Guidance Network for Missing-Modality Inference in Content Moderation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)