Table of Contents
Fetching ...

An Empirical Evaluation of Neural and Neuro-symbolic Approaches to Real-time Multimodal Complex Event Detection

Liying Han, Mani B. Srivastava

TL;DR

Empirically, the neuro-symbolic architecture significantly surpasses purely neural models, demonstrating superior performance in CE recognition, even with extensive training data and ample temporal context for neural approaches.

Abstract

Robots and autonomous systems require an understanding of complex events (CEs) from sensor data to interact with their environments and humans effectively. Traditional end-to-end neural architectures, despite processing sensor data efficiently, struggle with long-duration events due to limited context sizes and reasoning capabilities. Recent advances in neuro-symbolic methods, which integrate neural and symbolic models leveraging human knowledge, promise improved performance with less data. This study addresses the gap in understanding these approaches' effectiveness in complex event detection (CED), especially in temporal reasoning. We investigate neural and neuro-symbolic architectures' performance in a multimodal CED task, analyzing IMU and acoustic data streams to recognize CE patterns. Our methodology includes (i) end-to-end neural architectures for direct CE detection from sensor embeddings, (ii) two-stage concept-based neural models mapping sensor embeddings to atomic events (AEs) before CE detection, and (iii) a neuro-symbolic approach using a symbolic finite-state machine for CE detection from AEs. Empirically, the neuro-symbolic architecture significantly surpasses purely neural models, demonstrating superior performance in CE recognition, even with extensive training data and ample temporal context for neural approaches.

An Empirical Evaluation of Neural and Neuro-symbolic Approaches to Real-time Multimodal Complex Event Detection

TL;DR

Empirically, the neuro-symbolic architecture significantly surpasses purely neural models, demonstrating superior performance in CE recognition, even with extensive training data and ample temporal context for neural approaches.

Abstract

Robots and autonomous systems require an understanding of complex events (CEs) from sensor data to interact with their environments and humans effectively. Traditional end-to-end neural architectures, despite processing sensor data efficiently, struggle with long-duration events due to limited context sizes and reasoning capabilities. Recent advances in neuro-symbolic methods, which integrate neural and symbolic models leveraging human knowledge, promise improved performance with less data. This study addresses the gap in understanding these approaches' effectiveness in complex event detection (CED), especially in temporal reasoning. We investigate neural and neuro-symbolic architectures' performance in a multimodal CED task, analyzing IMU and acoustic data streams to recognize CE patterns. Our methodology includes (i) end-to-end neural architectures for direct CE detection from sensor embeddings, (ii) two-stage concept-based neural models mapping sensor embeddings to atomic events (AEs) before CE detection, and (iii) a neuro-symbolic approach using a symbolic finite-state machine for CE detection from AEs. Empirically, the neuro-symbolic architecture significantly surpasses purely neural models, demonstrating superior performance in CE recognition, even with extensive training data and ample temporal context for neural approaches.
Paper Structure (32 sections, 8 equations, 5 figures, 3 tables)

This paper contains 32 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of the real-time complex events detection task. The example on the right shows that "Using Restroom" and "Eating" without "Washing hands" triggers the complex event detection, but only at the last action "Washing hands" we attach the CE label "1" of this complex event.
  • Figure 2: Daily activity simulator. Each Stage has a set of $n$Activities that may happen according to a predefined distribution, where Activity$i$ has a probability $p_i$ of taking place in that Stage. Each Activity is defined by a temporal combination of relevant AEs. For example, in Daytime StageActivities "Walk-only", "Sit-only", "Restroom," "Work", and "Drink-only" happen with probabilities $[0.27, 0.27, 0.02, 0.4, 0.04]$ respectively. Each Activity is defined by the pattern displayed on the right side.
  • Figure 3: Overview of (Left) the entire real-time CED system, (Middle) the multimodal fusion module, and (Right) the complex event detector module.
  • Figure 4: Example CE label sequences predicted by NN models. Positive CE labels are highlighted in red in both ground-truth (label_i) and corresponding prediction (pred_i).
  • Figure 5: Evaluation of NN models with various CE training data sizes.