Table of Contents
Fetching ...

Action Detection via an Image Diffusion Process

Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Jun Liu

TL;DR

Action detection in untrimmed videos is framed as a three-image generation problem, producing an action-class image and two temporal-boundary images for starts and ends. The authors propose ADI-Diff, combining a Discrete Action-Detection Diffusion Process with a Row-Column Transformer to handle discrete probability distributions and distinct row/column dependencies in the AD images. The approach achieves state-of-the-art mean average precision on THUMOS14 and ActivityNet-1.3 and demonstrates practical efficiency gains via image stitching and end-to-end diffusion-based inference. Overall, this work highlights how casting predictions as structured images and tailoring diffusion models to discrete, multi-task outputs can substantially improve video action detection performance.

Abstract

Action detection aims to localize the starting and ending points of action instances in untrimmed videos, and predict the classes of those instances. In this paper, we make the observation that the outputs of the action detection task can be formulated as images. Thus, from a novel perspective, we tackle action detection via a three-image generation process to generate starting point, ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our images differ from natural images and exhibit special properties, we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.

Action Detection via an Image Diffusion Process

TL;DR

Action detection in untrimmed videos is framed as a three-image generation problem, producing an action-class image and two temporal-boundary images for starts and ends. The authors propose ADI-Diff, combining a Discrete Action-Detection Diffusion Process with a Row-Column Transformer to handle discrete probability distributions and distinct row/column dependencies in the AD images. The approach achieves state-of-the-art mean average precision on THUMOS14 and ActivityNet-1.3 and demonstrates practical efficiency gains via image stitching and end-to-end diffusion-based inference. Overall, this work highlights how casting predictions as structured images and tailoring diffusion models to discrete, multi-task outputs can substantially improve video action detection performance.

Abstract

Action detection aims to localize the starting and ending points of action instances in untrimmed videos, and predict the classes of those instances. In this paper, we make the observation that the outputs of the action detection task can be formulated as images. Thus, from a novel perspective, we tackle action detection via a three-image generation process to generate starting point, ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our images differ from natural images and exhibit special properties, we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.
Paper Structure (13 sections, 6 equations, 4 figures, 5 tables)

This paper contains 13 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of our formulated AD images, which allow us to tackle action detection by generating three images. The action-class AD image ($x^a$) has a shape of $N \times C$, while the starting and ending point AD images ($x^s$ and $x^e$) both have a shape of $N \times 2$, where we show $N =5$ and $C=5$ in this figure for illustration. Specifically, the pixel values in a row of the image form the probabilities of a discrete distribution regarding a specific video frame, e.g., the $n$-th row of the action-class AD image represents the probability distribution over the action classes for the $n$-th frame. We depict the ground truth AD images ($x^a_0,x^s_0,x^e_0$) in this figure, thus each row contains a single white pixel (with value 1) in each row depicting the correct prediction, while the other pixels are black in color (with value 0).
  • Figure 2: Illustration of the proposed AD Image Diffusion (ADI-Diff) framework. The forward process (represented with orange arrows) progressively diffuses the ground truth AD images $x_0^a,x_0^s,x_0^e$ towards a noisy outcome, which generates supervisory signals for intermediate steps. On the other hand, the reverse process (represented with green arrows) is trained to denoise the noisy inputs $x_T^a,x_T^s,x_T^e$ while conditioned on extracted spatio-temporal features $f_{ST}$ from the input video, to obtain the output AD images $\hat{x}_0^a,\hat{x}_0^s,\hat{x}_0^e$.
  • Figure 3: Visualization of diffusion process.
  • Figure 4: Comparison between the action-class AD image generated by our method (left) and standard diffusion (right).