KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal; Tanmay Talsaniya; Parag Patel; Meet Patel

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel

Abstract

We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Abstract

Paper Structure (34 sections, 2 figures, 6 tables)

This paper contains 34 sections, 2 figures, 6 tables.

Introduction
Related Work
Image Safety Classification
Vision-Language Models for Content Safety
Online Child Safety and Grooming Detection
Multimodal Content Moderation
Methodology
System Architecture
Architecture Specification
Two-Stage Inference Strategy
OCR-Based Text Safety
Contextual Reasoning Module
Training Data
Model Selection Transparency
Evaluation
...and 19 more sections

Figures (2)

Figure 1: KidsNanny two-stage architecture. Stage 1 performs visual screening using a ViT classifier and object detector (11.7 ms). Stage 2 extracts text via OCR and passes extracted text + object labels (not raw pixels) to a 7B text-based language model for contextual reasoning (120 ms total pipeline).
Figure 2: Model performance visualizations. Diamond markers = multimodal models; circles = vision-only models. (a) KidsNanny (Stage 1+2) leads all models in accuracy (81.40%) and F1 (86.16%). (b) KidsNanny occupies the balanced deployment-viable zone; ShieldGemma-2 achieves the highest recall but at low precision (64.96%). (c) KidsNanny Stage 1 (11.7 ms) lies on the Pareto frontier; Stage 1+2 (120 ms) is the overall accuracy leader at 9--34$\times$ lower latency than competing VLMs.

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Abstract

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Authors

Abstract

Table of Contents

Figures (2)