Table of Contents
Fetching ...

Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model

Till Grutschus, Ola Karrar, Emir Esenov, Ekta Vats

TL;DR

This paper tackles human fall detection in untrimmed videos by leveraging a large video understanding foundation model (VideoMAEv2 ViT-B) instead of bespoke architectures. It introduces a simple cutup-based temporal action localization pipeline, along with Gaussian sampling, to convert timestamped videos into labeled short clips for training, with a priority labeling scheme for Fall/Lying/ADL. The Gaussian sampling uses seeds $t_i$ drawn from $t_i \sim \mathcal{N}(t_{Fall}, \frac{1}{3}\min\{t_{Fall}, T - t_{Fall}\})$ around the fall midpoint, combined with a clip length parameter $T_{clip}$. On HQFSD, the approach achieves a state-of-the-art video-level F1 score of $0.96$ under the given settings, demonstrating real-time applicability; code and pretrained models will be released on GitHub.

Abstract

This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.

Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model

TL;DR

This paper tackles human fall detection in untrimmed videos by leveraging a large video understanding foundation model (VideoMAEv2 ViT-B) instead of bespoke architectures. It introduces a simple cutup-based temporal action localization pipeline, along with Gaussian sampling, to convert timestamped videos into labeled short clips for training, with a priority labeling scheme for Fall/Lying/ADL. The Gaussian sampling uses seeds drawn from around the fall midpoint, combined with a clip length parameter . On HQFSD, the approach achieves a state-of-the-art video-level F1 score of under the given settings, demonstrating real-time applicability; code and pretrained models will be released on GitHub.

Abstract

This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video and leverages a pretrained vision transformer for multi-class action detection, with classes: "Fall", "Lying" and "Other/Activities of daily living (ADL)". A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated. The methodology includes a preprocessing pipeline that converts datasets with timestamp action annotations into labeled datasets of short action clips. Simple and effective clip-sampling strategies are introduced. The effectiveness of the proposed method has been empirically evaluated on the publicly available High-Quality Fall Simulation Dataset (HQFSD). The experimental results validate the performance of the proposed pipeline. The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset under the given experimental settings. The source code will be made available on GitHub.
Paper Structure (25 sections, 2 equations, 4 figures, 7 tables)

This paper contains 25 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Cutup and detect method pipeline. From the raw untrimmed footage annotated with action timestamps, clips are extracted and labeled with sampling and labeling strategies. The resulting dataset of clips is used in the downstream action recognition model.
  • Figure 2: Illustration of Gaussian sampling for a single video within the HQFSD. The red crosses illustrate sampled seeds $t_i$. Figure best viewed in colors.
  • Figure 3: Overview of the action recognition pipeline. The clips are preprocessed, and individual frames are sampled. A vision transformer backbone is used for feature extraction, and a fully connected head is used for classification.
  • Figure 4: Visualisation of results for selected frames from different clips.