Table of Contents
Fetching ...

Pose-guided multi-task video transformer for driver action recognition

Ricardo Pizarro, Roberto Valle, Luis Miguel Bergasa, José M. Buenaposada, Luis Baumela

TL;DR

This work tackles driver distraction recognition from in-car video by introducing PO-GUISE, a multi-task video transformer that jointly predicts driver actions and pose. By integrating pose heatmaps as learnable tokens and guiding token selection with both pose and class information, the approach reduces the number of spatio-temporal tokens, cutting GFLOPs while preserving or improving accuracy. The method achieves state-of-the-art results on driving datasets Drive&Act and 3MDAD, and maintains competitive performance on HMDB51 with significantly lower computational cost. The end-to-end pose-aware token pruning and merging enable more robust action recognition, particularly for object-interaction activities, and demonstrate the practical viability of efficient transformer-based driver monitoring systems.

Abstract

We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.

Pose-guided multi-task video transformer for driver action recognition

TL;DR

This work tackles driver distraction recognition from in-car video by introducing PO-GUISE, a multi-task video transformer that jointly predicts driver actions and pose. By integrating pose heatmaps as learnable tokens and guiding token selection with both pose and class information, the approach reduces the number of spatio-temporal tokens, cutting GFLOPs while preserving or improving accuracy. The method achieves state-of-the-art results on driving datasets Drive&Act and 3MDAD, and maintains competitive performance on HMDB51 with significantly lower computational cost. The end-to-end pose-aware token pruning and merging enable more robust action recognition, particularly for object-interaction activities, and demonstrate the practical viability of efficient transformer-based driver monitoring systems.

Abstract

We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model's computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.
Paper Structure (23 sections, 1 equation, 10 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our architecture consists of 4 stages. An input clip is tokenized and processed by a ViT encoder alongside learnable class and heatmap tokens. Within the encoder, green and blue blocks denote standard ViT layers and token selection modules respectively.
  • Figure 2: Per class accuracy on Drive&Act dataset fold 0. Bars illustrate the performance of the baseline model, red 'X' marks that of the model augmented with heatmap data and '+' symbols the results of PO-GUISE. For a better presentation, we have grouped classes that are related to time-based activities, such as opening and closing a bottle, putting on or taking off sunglasses.
  • Figure 3: Left: Comparison between GFLOPS and accuracy for different configurations. Results on fold 0 Drive&Act. Right: Comparison between GFLOPS and accuracy between our proposed models and Transdarcpeng2022transdarc. Averaged over the three folds of Drive&Act.
  • Figure 4: Sample tokens from a frame processed by PO-GUISE at each of the three stages of the network, showing selected, discarded, and merged tokens. Red shows tokens that were discarded and blue those that were selected for merging in that stage.
  • Figure 5: Confusion Matrix of the merged classes on Drive&Act dataset fold 0 for the baseline VideoMaeV2 model.
  • ...and 5 more figures