Table of Contents
Fetching ...

Temporal Action Localization with Cross Layer Task Decoupling and Refinement

Qiang Li, Di Liu, Jun Kong, Sen Li, Hui Xu, Jianzhong Wang

TL;DR

This work tackles Temporal Action Localization by addressing the conflicting feature needs of classification and localization. It introduces CLTDR-GMG, pairing a GMG encoder that aggregates instant, local, and global temporal context via FFT-based global features with a CLTDR decoder that cross-layerly decouples classification and localization and includes a refinement head. The approach yields state-of-the-art performance on five challenging benchmarks (THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, HACS) and is supported by comprehensive ablations demonstrating the benefits of cross-layer attention, multi-granularity encoding, and refinement. The method is efficient and generalizable, with code released for public use, and represents a significant step toward balanced, accurate TAL in diverse video datasets.

Abstract

Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models are publicly available at: https://github.com/LiQiang0307/CLTDR-GMG.

Temporal Action Localization with Cross Layer Task Decoupling and Refinement

TL;DR

This work tackles Temporal Action Localization by addressing the conflicting feature needs of classification and localization. It introduces CLTDR-GMG, pairing a GMG encoder that aggregates instant, local, and global temporal context via FFT-based global features with a CLTDR decoder that cross-layerly decouples classification and localization and includes a refinement head. The approach yields state-of-the-art performance on five challenging benchmarks (THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, HACS) and is supported by comprehensive ablations demonstrating the benefits of cross-layer attention, multi-granularity encoding, and refinement. The method is efficient and generalizable, with code released for public use, and represents a significant step toward balanced, accurate TAL in diverse video datasets.

Abstract

Temporal action localization (TAL) involves dual tasks to classify and localize actions within untrimmed videos. However, the two tasks often have conflicting requirements for features. Existing methods typically employ separate heads for classification and localization tasks but share the same input feature, leading to suboptimal performance. To address this issue, we propose a novel TAL method with Cross Layer Task Decoupling and Refinement (CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates semantically strong features from higher pyramid layers and detailed boundary-aware boundary features from lower pyramid layers to effectively disentangle the action classification and localization tasks. Moreover, the multiple features from cross layers are also employed to refine and align the disentangled classification and regression results. At last, a lightweight Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and aggregate video features at instant, local, and global temporal granularities. Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art performance on five challenging benchmarks: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models are publicly available at: https://github.com/LiQiang0307/CLTDR-GMG.

Paper Structure

This paper contains 22 sections, 13 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Comparison of different task decoupling. (a) Previous methods use two head branches share the same input feature; (b) Our method uses cross layer features for task decoupling and refinement. Zoom in for better view.
  • Figure 2: An illustration of our method. We build a feature pyramid with GMG module. The CLTDR decoder at the l-th pyramid layer leverages the features $P^{l+1}$ and $P^{l-1}$ to generate distinct representations for classification and localization tasks, followed by a refinement using RefineHead.
  • Figure 3: Illustration of GMG. Zoom in for better view.
  • Figure 4: Illustration of Decoupled Classification Module and Decoupled Regression Module.
  • Figure 5: The visual comparison between CLTDR-GMG and other two methods before NMS. Each point denotes the classification and localization results (i.e., classification score and tIOU) of a frame in the video. The points with the best classification scores are marked as "$\star$".
  • ...and 3 more figures