Table of Contents
Fetching ...

Taylor Videos for Action Recognition

Lei Wang, Xiuyuan Yuan, Tom Gedeon, Liang Zheng

TL;DR

This work defines an implicit motion-extraction function which aims to extract motions from video temporal block, and shows the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed.

Abstract

Effectively extracting motions from video is a critical and long-standing problem for action recognition. This problem is very challenging because motions (i) do not have an explicit form, (ii) have various concepts such as displacement, velocity, and acceleration, and (iii) often contain noise caused by unstable pixels. Addressing these challenges, we propose the Taylor video, a new video format that highlights the dominate motions (e.g., a waving hand) in each of its frames named the Taylor frame. Taylor video is named after Taylor series, which approximates a function at a given point using important terms. In the scenario of videos, we define an implicit motion-extraction function which aims to extract motions from video temporal block. In this block, using the frames, the difference frames, and higher-order difference frames, we perform Taylor expansion to approximate this function at the starting frame. We show the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed. Experimentally we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved. Additionally, we apply Taylor video computation to human skeleton sequences, resulting in Taylor skeleton sequences that outperform the use of original skeletons for skeleton-based action recognition.

Taylor Videos for Action Recognition

TL;DR

This work defines an implicit motion-extraction function which aims to extract motions from video temporal block, and shows the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed.

Abstract

Effectively extracting motions from video is a critical and long-standing problem for action recognition. This problem is very challenging because motions (i) do not have an explicit form, (ii) have various concepts such as displacement, velocity, and acceleration, and (iii) often contain noise caused by unstable pixels. Addressing these challenges, we propose the Taylor video, a new video format that highlights the dominate motions (e.g., a waving hand) in each of its frames named the Taylor frame. Taylor video is named after Taylor series, which approximates a function at a given point using important terms. In the scenario of videos, we define an implicit motion-extraction function which aims to extract motions from video temporal block. In this block, using the frames, the difference frames, and higher-order difference frames, we perform Taylor expansion to approximate this function at the starting frame. We show the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed. Experimentally we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved. Additionally, we apply Taylor video computation to human skeleton sequences, resulting in Taylor skeleton sequences that outperform the use of original skeletons for skeleton-based action recognition.
Paper Structure (16 sections, 8 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 8 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Visualizing different video formats. (Top left): RGB video and time-color reordering frames kim2022capturing. (Bottom left): ${\bm{U}}$ and ${\bm{V}}$ components of optical flow. (Right): proposed Taylor video frames. Taylor frames clearly (i) remove static objects and unstable motions and (ii) highlight motions.
  • Figure 2: Computing a single Taylor frame from a grayscale video temporal block ${\bm{\mathsfit{F}}}\!=\![{\bm{F}}_1, {\bm{F}}_2, \cdots, {\bm{F}}_\tau, \cdots,{\bm{F}}_T], \tau\!=\!1, 2, \cdots, T$. We calculate the difference map between each two consecutive frames: $d({\bm{F}}_i)\!=\!{\bm{F}}_{i\!+\!1}\!-\!{\bm{F}}_i$, $i\!=\!1, 2, \cdots, T$. We then calculate the higher-order differences, e.g., velocity maps using $v({\bm{F}}_i)\!=\!d({\bm{F}}_{i\!+\!1})\!-\!d({\bm{F}}_i)$, acceleration maps using $a({\bm{F}}_i)\!=\!v({\bm{F}}_{i\!+\!1})\!-\!v({\bm{F}}_i)$, jerk maps, etc., in the temporal block. We compute three channels of a Taylor frame by Eq. (\ref{['eq:displace']}), (\ref{['eq:velocity']}), and (\ref{['eq:acc']}), visualized in red, green, and blue, respectively.
  • Figure 3: Taylor frames indicate motion strengths and directions. (Top): Taylor frames. (Bottom): original RGB frames. All videos are from HMDB-51. Red, green, and blue represent displacement, velocity, and acceleration, respectively. Bolder colors indicate greater strength. We depict motion directions with white arrows: if green (velocity) is to the right of red (displacement), the object is moving rightwards in the next frame; if blue (acceleration) is to the left of red (displacement), the object is moving leftwards in the frame after.
  • Figure 4: Taylor videos remove redundancy, such as static backgrounds, unstable pixels, watermarks, and captions. This, together with its ability to highlight motion including strengths and directions, is beneficial for action recognition. Videos are from HMDB-51.
  • Figure 5: Taylor frame captures subtle motions on MPII. (Top 4 images) show squeeze and (Bottom 4 images) show put in pan/pot. In each set, the left motion region is zoomed in on the right to enhance visualization. Better view in color.
  • ...and 6 more figures