A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)
Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Alexander Schindler
TL;DR
The paper addresses the fragmentation of video analysis by proposing a unified multimodal toolchain that fuses audio and visual cues. It introduces a modular architecture where S2T, AED, ASC, VOD, IC, and VC are run as Dockerized components, exporting per-task JSON outputs that feed clustering, comprehensive summaries, and riot-context detection. The work demonstrates the approach on datasets VCDB, DCASE 2021 Task 1B, and Crowded Scene, illustrating effective audio/video clustering, descriptive summaries, and keyword-based riot detection. This framework offers flexible extension to new applications and practical utility for large-scale media analysis and safety-critical monitoring.
Abstract
In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.
