Action Detection via an Image Diffusion Process
Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Jun Liu
TL;DR
Action detection in untrimmed videos is framed as a three-image generation problem, producing an action-class image and two temporal-boundary images for starts and ends. The authors propose ADI-Diff, combining a Discrete Action-Detection Diffusion Process with a Row-Column Transformer to handle discrete probability distributions and distinct row/column dependencies in the AD images. The approach achieves state-of-the-art mean average precision on THUMOS14 and ActivityNet-1.3 and demonstrates practical efficiency gains via image stitching and end-to-end diffusion-based inference. Overall, this work highlights how casting predictions as structured images and tailoring diffusion models to discrete, multi-task outputs can substantially improve video action detection performance.
Abstract
Action detection aims to localize the starting and ending points of action instances in untrimmed videos, and predict the classes of those instances. In this paper, we make the observation that the outputs of the action detection task can be formulated as images. Thus, from a novel perspective, we tackle action detection via a three-image generation process to generate starting point, ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our images differ from natural images and exhibit special properties, we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.
