Table of Contents
Fetching ...

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang

TL;DR

This work proposes ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective and proposes an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones.

Abstract

Event cameras have recently been shown beneficial for practical vision tasks, such as action recognition, thanks to their high temporal resolution, power efficiency, and reduced privacy concerns. However, current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information, rendering it stunningly superior in reducing semantic uncertainty. In light of this, we propose ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly, we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then, we propose a conceptual reasoning-based uncertainty estimation module, which simulates the recognition process to enrich the semantic representation. In particular, conceptual reasoning builds the temporal relation based on the action semantics, and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

TL;DR

This work proposes ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective and proposes an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones.

Abstract

Event cameras have recently been shown beneficial for practical vision tasks, such as action recognition, thanks to their high temporal resolution, power efficiency, and reduced privacy concerns. However, current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information, rendering it stunningly superior in reducing semantic uncertainty. In light of this, we propose ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly, we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then, we propose a conceptual reasoning-based uncertainty estimation module, which simulates the recognition process to enrich the semantic representation. In particular, conceptual reasoning builds the temporal relation based on the action semantics, and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.
Paper Structure (14 sections, 6 equations, 8 figures, 5 tables)

This paper contains 14 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Unlike stationary objects, e.g., ‘Dragonfly' with short duration (0.1s) and limited semantics, dynamic actions like ‘Sit down' have the prolonged duration (5s) with ambiguous and complex semantics. (b) Compared with previous event representation, stacking events with fixed counts, we adaptively filter out events recording stationary actions while preserving dynamic ones; (c) We introduce language guidance to stimulate the recognition process, particularly focusing on conceptually reasoning temporal relations and estimating uncertain semantics.
  • Figure 2: Overall framework of our proposed ExACT framework. It consists of four components: (1) the AFE representation recursively eliminates repeated events and generates event frames depicting dynamic actions (Sec. \ref{['sec:representation']}); (2) the event encoder and (3) the text encoder, responsible for the event and text embedding, respectively (Sec. \ref{['sec:feature_embedding']}); (4) the CRUE module simulates the action recognition process to establish the complex semantic relations for sub-actions and reduce the semantic uncertainty. (Sec. \ref{['sec:CRUE']})
  • Figure 3: (a) Unlike existing methods often lead to repetitive event frames, our AFE representation adaptively filters out repetitive events for the same action based on the observed overlapped action regions; (b) Illustration of the AFE representation.
  • Figure 4: The proposed CRUE module consists of 1) conceptual reasoning for frame fusion based on the temporal relation among events and 2) uncertainty estimation of sub-actions for both text and event embeddings utilizing distributional representation.
  • Figure 5: Examples of our SeAct dataset.
  • ...and 3 more figures