Table of Contents
Fetching ...

Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

Chunzhuo Wang, Zhewen Xue, T. Sunil Kumar, Guido Camps, Hans Hallez, Bart Vanrumste

TL;DR

This work tackles automated dietary monitoring by detecting food intake gestures using a skeleton-based representation to preserve privacy and improve robustness. It introduces a dilated spatial-temporal graph convolutional network (ST-GCN) fused with a BiLSTM (ST-GCN-BiLSTM) that processes 23 upper-body keypoints to identify eating and drinking gestures from untrimmed video. Across the OREBA dataset and a smartphone-based home dataset, the approach achieves high eating gesture F1-scores (e.g., 86.18% at k=0.1 on OREBA) and reasonable drinking gesture performance, with best results obtained when combining all four body-part groups. The results demonstrate cross-dataset validity and privacy advantages over RGB-based methods, while highlighting the need for improved drinking gesture detection and higher-quality keypoint extraction. The findings suggest skeleton-based methods are viable for continuous dietary monitoring and motivate future multimodal extensions to further boost performance.

Abstract

Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.

Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

TL;DR

This work tackles automated dietary monitoring by detecting food intake gestures using a skeleton-based representation to preserve privacy and improve robustness. It introduces a dilated spatial-temporal graph convolutional network (ST-GCN) fused with a BiLSTM (ST-GCN-BiLSTM) that processes 23 upper-body keypoints to identify eating and drinking gestures from untrimmed video. Across the OREBA dataset and a smartphone-based home dataset, the approach achieves high eating gesture F1-scores (e.g., 86.18% at k=0.1 on OREBA) and reasonable drinking gesture performance, with best results obtained when combining all four body-part groups. The results demonstrate cross-dataset validity and privacy advantages over RGB-based methods, while highlighting the need for improved drinking gesture detection and higher-quality keypoint extraction. The findings suggest skeleton-based methods are viable for continuous dietary monitoring and motivate future multimodal extensions to further boost performance.

Abstract

Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.

Paper Structure

This paper contains 8 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Data collection scenes from (a) OREBA dataset (b) Smartphone footage dataset.
  • Figure 2: The original video frame alongside the reconstructed skeleton heatmap, created using preprocessed keypoint data, is presented. The heatmap includes labels for the x, y, and confidence scores corresponding to two specific keypoints.
  • Figure 3: The design of the proposed ST-GCN-BiLSTM model, along with the specific configuration of each individual ST-GCN block.
  • Figure 4: An example of a drinking gesture is shown, where the model incorrectly classifies it as an eating gesture. This misclassification might be due to the change in hand position while placing the glass down.
  • Figure 5: The figure illustrates the skeletal representations of four distinct body parts: (a) face, (b) mouth, (c) arm, and (d) hands. Each subfigure highlights the skeletal structure specific to the corresponding body part. When multiple body parts are utilized by the model, suitable connections between them are incorporated.