Table of Contents
Fetching ...

Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring

Dan Lin, Philip Hann Yung Lee, Yiming Li, Ruoyu Wang, Kim-Hui Yap, Bingbing Li, You Shing Ngim

TL;DR

This work tackles Driver Action Recognition (DAR) in vehicle cabins under multi-modal sensing by introducing DFS, a dual feature shift framework. DFS performs modality feature interaction across modalities and neighbour feature propagation within temporal frames, while sharing encoders in the middle stages to learn cross-modality patterns efficiently. Evaluated on the Drive&Act dataset, DFS achieves state-of-the-art Top-1 and balanced accuracies (e.g., Top-1 ≈ $77.61\%$, Bal ≈ $63.12\%$) and demonstrates notable efficiency gains (latency ≈ $28.0$ ms, fewer parameters than TSM) with multi-modality inputs such as IR+Depth. The approach offers a practical, real-time solution for robust car-cabin DAR, leveraging cross-modality fusion to handle partial visibility and variable lighting in real-world driving scenarios.

Abstract

Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction. Meanwhile, DFS achieves the neighbour feature propagation within single modalities, by feature shifting among temporal frames. To learn common patterns and improve model efficiency, DFS shares feature extracting stages among multiple modalities. Extensive experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive\&Act dataset. The results demonstrate that DFS achieves good performance and improves the efficiency of multi-modality driver action recognition.

Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring

TL;DR

This work tackles Driver Action Recognition (DAR) in vehicle cabins under multi-modal sensing by introducing DFS, a dual feature shift framework. DFS performs modality feature interaction across modalities and neighbour feature propagation within temporal frames, while sharing encoders in the middle stages to learn cross-modality patterns efficiently. Evaluated on the Drive&Act dataset, DFS achieves state-of-the-art Top-1 and balanced accuracies (e.g., Top-1 ≈ , Bal ≈ ) and demonstrates notable efficiency gains (latency ≈ ms, fewer parameters than TSM) with multi-modality inputs such as IR+Depth. The approach offers a practical, real-time solution for robust car-cabin DAR, leveraging cross-modality fusion to handle partial visibility and variable lighting in real-world driving scenarios.

Abstract

Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction. Meanwhile, DFS achieves the neighbour feature propagation within single modalities, by feature shifting among temporal frames. To learn common patterns and improve model efficiency, DFS shares feature extracting stages among multiple modalities. Extensive experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive\&Act dataset. The results demonstrate that DFS achieves good performance and improves the efficiency of multi-modality driver action recognition.
Paper Structure (16 sections, 3 equations, 3 figures, 5 tables)

This paper contains 16 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Sample frame sequences from different modalities for the action 'eating'. For each modality, drivers' actions are performed by the same individual with only a portion of the body visible and in unstable lighting conditions.
  • Figure 2: Framework of the proposed DFS model. DFS consists of five feature learning stages, followed by the fusion layer and fully connected (FC) layer. Between every two stages, the dual feature shift mechanism includes both modality and temporal feature interactions. In the middle stages 2 and 3, DFS shares weights among modalities to improve the model efficiency.
  • Figure 3: Illustration of the shared feature encoders among different modalities. The weight $W$ is shared.