Boundary Discretization and Reliable Classification Network for Temporal Action Detection

Zhenying Fang; Jun Yu; Richang Hong

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

Zhenying Fang, Jun Yu, Richang Hong

TL;DR

BDRC-Net tackles temporal action detection by integrating boundary discretization (BDM) with a reliable classification module (RCM) to address the limitations of mixed-method approaches. BDM uses a coarse-to-fine boundary localization via CCSM and RRSM, while RCM leverages MIL to predict reliable video-level action categories and suppress false positives. The approach demonstrates competitive performance on THUMOS'14, ActivityNet-1.3, and MultiTHUMOS, with robust cross-backbone results and ablations validating component contributions. This yields a practical, efficient mixed-method TAD framework that improves boundary accuracy and reduces false positives, with code released for reproducibility.

Abstract

Temporal action detection aims to recognize the action category and determine each action instance's starting and ending time in untrimmed videos. The mixed methods have achieved remarkable performance by seamlessly merging anchor-based and anchor-free approaches. Nonetheless, there are still two crucial issues within the mixed framework: (1) Brute-force merging and handcrafted anchor design hinder the substantial potential and practicality of the mixed methods. (2) Within-category predictions show a significant abundance of false positives. In this paper, we propose a novel Boundary Discretization and Reliable Classification Network (BDRC-Net) that addresses the issues above by introducing boundary discretization and reliable classification modules. Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, eliminating the need for the traditional handcrafted anchor design. Furthermore, the reliable classification module (RCM) predicts reliable global action categories to reduce false positives. Extensive experiments conducted on different benchmarks demonstrate that our proposed method achieves competitive detection performance. The code will be released at https://github.com/zhenyingfang/BDRC-Net.

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 6 figures, 11 tables)

This paper contains 22 sections, 10 equations, 6 figures, 11 tables.

introduction
Related work
Anchor-based TAD
Anchor-free TAD
Query-based TAD
Mixed TAD
Method
Problem Definition
Multi-Scale Backbone (MSB)
Boundary Discretization Module (BDM)
Reliable Classification Module (RCM)
Training and inference
Training
Inference
experiments
...and 7 more sections

Figures (6)

Figure 1: Comparison of boundary discretization, anchor-based, and anchor-free methods for representing action boundaries.
Figure 2: Example of non-discriminative snippets. It is impossible to accurately classify them with only partial snippets from two action categories, namely HighJump and LongJump. However, if the context of all snippets can be fully utilized, these actions can be easily distinguished.
Figure 3: Architecture Overview. Given an untrimmed video, the feature encoder extracts snippet features. Then, the multi-scale backbone (MSB) is used to extract multi-scale spatio-temporal features. Finally, the BDM in the detection head predicts action boundaries on each snippet, and the classification module is used to predict the action categories for each snippet. Specifically, our RCM predicts the video-level action categories based on snippet features, which are used to filter out false positives in the classification module. In the output module of CCSM, green and blue respectively represent the regions of the start and end bins, with gray indicating invalid bins. The darker the green or blue, the higher the probability that the corresponding bin is a start or end bin. In the output module of RRSM, the meanings of green and blue are the same as in CCSM. Red arrows indicate the predicted offset direction by RRSM, with the offset distance prediction omitted for simplicity. In the output of the classification module, red indicates that the corresponding bin is predicted as the corresponding action category. In the output module of RCM, the longer the light purple length, the higher the probability that the input video contains the corresponding action category. BDRC-Net obtains the coarse prediction results through the probabilities predicted by CCSM and refines these results using the offset direction and offset distance predicted by RRSM. The action category of each prediction result is determined jointly by the classification module and the RCM module, with the RCM module primarily used to filter out potential false positive predictions from the classification module.
Figure 4: BDM. For the $t$-th snippet, CCSM selects the center position of the bin with the highest predicted probability as a coarse prediction of the action boundary. Subsequently, RRSM regresses the distance between the coarse prediction and the actual boundary to obtain refined boundary prediction.
Figure 5: Label assignment in BDM. For the starting bin on the $t$-th snippet, firstly, the distance $d_{s,t}$ between snippet $t$ and the starting boundary is computed. Then, a Gaussian function with mean and variance as $d_{s,t}$ and $\sigma$ is used to assign labels in CCSM. Specifically, the label for each bin in CCSM is the value of its central position on the Gaussian function. We further assign labels for RRSM, where the label for each bin on RRSM is the distance between $d_{s,t}$ and its center position.
...and 1 more figures

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

TL;DR

Abstract

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)