Table of Contents
Fetching ...

From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception

Petros Toupas, Georgios Tsamis, Dimitrios Giakoumis, Konstantinos Votis, Dimitrios Tzovaras

TL;DR

An end-to-end pipeline is proposed that encompasses the entire process, starting from human detection and tracking, leading to action recognition, designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation.

Abstract

Mobile service robots are proving to be increasingly effective in a range of applications, such as healthcare, monitoring Activities of Daily Living (ADL), and facilitating Ambient Assisted Living (AAL). These robots heavily rely on Human Action Recognition (HAR) to interpret human actions and intentions. However, for HAR to function effectively on service robots, it requires prior knowledge of human presence (human detection) and identification of individuals to monitor (human tracking). In this work, we propose an end-to-end pipeline that encompasses the entire process, starting from human detection and tracking, leading to action recognition. The pipeline is designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation. To identify the most suitable models for our mobile robot, we conducted a series of experiments comparing state-of-the-art solutions based on both their detection performance and efficiency. To evaluate the effectiveness of our proposed pipeline, we proposed a dataset comprising daily household activities. By presenting our findings and analysing the results, we demonstrate the efficacy of our approach in enabling mobile robots to understand and respond to human behaviour in real-world scenarios relying mainly on the data from their RGB cameras.

From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception

TL;DR

An end-to-end pipeline is proposed that encompasses the entire process, starting from human detection and tracking, leading to action recognition, designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation.

Abstract

Mobile service robots are proving to be increasingly effective in a range of applications, such as healthcare, monitoring Activities of Daily Living (ADL), and facilitating Ambient Assisted Living (AAL). These robots heavily rely on Human Action Recognition (HAR) to interpret human actions and intentions. However, for HAR to function effectively on service robots, it requires prior knowledge of human presence (human detection) and identification of individuals to monitor (human tracking). In this work, we propose an end-to-end pipeline that encompasses the entire process, starting from human detection and tracking, leading to action recognition. The pipeline is designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation. To identify the most suitable models for our mobile robot, we conducted a series of experiments comparing state-of-the-art solutions based on both their detection performance and efficiency. To evaluate the effectiveness of our proposed pipeline, we proposed a dataset comprising daily household activities. By presenting our findings and analysing the results, we demonstrate the efficacy of our approach in enabling mobile robots to understand and respond to human behaviour in real-world scenarios relying mainly on the data from their RGB cameras.
Paper Structure (11 sections, 1 equation, 6 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 1 equation, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: The proposed pipeline architecture deployed end-to-end on the mobile robotic platform, from human detection and tracking to action recognition.
  • Figure 2: The steps of the pipeline from the perspective of the robot Starting with capturing the raw RGB image, detecting the skeletons of the people in the scene, projecting them to 3D space, assigning tracking IDs to each individual, and cropping the user's bbox to provide as input to the HAR model.
  • Figure 3: Human skeletal keypoints
  • Figure 4: An overview of the proposed sliding window overlap methodology. The input RGB feed is displayed at the top of the figure, organised into time windows of equal duration ($t_pw$). The sliding windows ($t_sw$) span across multiple $t_pw$.
  • Figure 5: Comparison of state-of-art HAR models on both prediction accuracy and execution performance. The dots in both graphs represent the scale of the models' number of parameters, which ranges from 2.99 to 121 million. The larger the dot, the greater the number of parameters in the corresponding model.
  • ...and 1 more figures