BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok
Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick, Magdalayna Curry, Laura D'Adamo, Lindsay Young, Stuart B Murray, Kristina Lerman
TL;DR
BigTokDetect tackles the challenge of detecting pro-bigorexia content on TikTok by introducing a clinically informed, multimodal framework that fuses visual, audio, and textual signals. It builds BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinicians across a detailed taxonomy, enabling robust evaluation of vision-language models under zero-shot, few-shot, and finetuning paradigms. The study finds that multimodal fusion substantially boosts detection performance (5–15%), with video cues contributing the most discriminative information, and that domain-specific finetuning on open-source models closes the gap to proprietary API models for fine-grained subcategory detection. Severity prediction remains challenging, but expert-guided annotation and modality integration provide a solid foundation for scalable, clinically informed content moderation aimed at mitigating harmful mental health risks. Overall, the work establishes a reproducible, domain-adapted framework for moderating nuanced online harms in emergent mental health domains.
Abstract
Social media platforms face escalating challenges in detecting harmful content that promotes muscle dysmorphic behaviors and cognitions (bigorexia). This content can evade moderation by camouflaging as legitimate fitness advice and disproportionately affects adolescent males. We address this challenge with BigTokDetect, a clinically informed framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal benchmark dataset of over 2,200 TikTok videos labeled by clinical psychiatrists across five categories and eighteen fine-grained subcategories. Comprehensive evaluation of state-of-the-art vision-language models reveals that while commercial zero-shot models achieve the highest accuracy on broad primary categories, supervised fine-tuning enables smaller open-source models to perform better on fine-grained subcategory detection. Ablation studies show that multimodal fusion improves performance by 5 to 15 percent, with video features providing the most discriminative signals. These findings support a grounded moderation approach that automates detection of explicit harms while flagging ambiguous content for human review, and they establish a scalable framework for harm mitigation in emerging mental health domains.
