Table of Contents
Fetching ...

TadML: A fast temporal action detection with Mechanics-MLP

Bowen Deng, Dongchang Liu

TL;DR

TadML tackles the inefficiency of RGB+flow pipelines in temporal action detection by offering an RGB-only, one-stage anchor-free approach. It introduces Mechanics-MLP, a Newtonian mechanics-inspired MLP that performs token mixing and multi-scale temporal feature fusion via a Time Fusion Pyramid Network (TFPN), yielding competitive accuracy and substantially faster inference than prior methods. A beta-GIoU loss is proposed to improve time-boundary regression. On THUMOS14 and ActivityNet1.3, TadML achieves state-of-the-art or near-state-of-the-art results with up to 4.44 videos per second on THUMOS14, illustrating the practical advantage of RGB-only, anchor-free design.

Abstract

Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that MLP has great potential in downstream tasks such as TAD. The source code is available at https://github.com/BonedDeng/TadML

TadML: A fast temporal action detection with Mechanics-MLP

TL;DR

TadML tackles the inefficiency of RGB+flow pipelines in temporal action detection by offering an RGB-only, one-stage anchor-free approach. It introduces Mechanics-MLP, a Newtonian mechanics-inspired MLP that performs token mixing and multi-scale temporal feature fusion via a Time Fusion Pyramid Network (TFPN), yielding competitive accuracy and substantially faster inference than prior methods. A beta-GIoU loss is proposed to improve time-boundary regression. On THUMOS14 and ActivityNet1.3, TadML achieves state-of-the-art or near-state-of-the-art results with up to 4.44 videos per second on THUMOS14, illustrating the practical advantage of RGB-only, anchor-free design.

Abstract

Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that MLP has great potential in downstream tasks such as TAD. The source code is available at https://github.com/BonedDeng/TadML
Paper Structure (12 sections, 3 equations, 4 figures, 5 tables)

This paper contains 12 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparing to the performance (average chart) and speed of the latest time action detection model on THUMOS 14. Our method shows advanced performance and very fast speed when using RGB stream.
  • Figure 2: The image showcases three mainstream methods. the traditional two stream method, the two stream one stage method and the RGB only one stage method.
  • Figure 3: Left the diagram of a block in the Mechanics-MLP architecture, right is token mixing.
  • Figure 4: The architecture consists of three main parts: a backbone module for feature extraction and downsampling in time, a time fusion pyramid network (TFPN) serving as the neck, and action and time prediction branches operating as the head.