Table of Contents
Fetching ...

Revisiting Learning-based Video Motion Magnification for Real-time Processing

Hyunwoo Ha, Oh Hyun-Bin, Kim Jun-Seong, Kwon Byung-Ki, Kim Sung-Bin, Linh-Tam Tran, Ji-Yun Kim, Sung-Ho Bae, Tae-Hyun Oh

TL;DR

A real-time deep learning-based motion magnification model that runs in real time for full-HD resolution videos with 2.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality is introduced.

Abstract

Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being extended to various online applications. In this paper, we investigate an efficient deep learning-based motion magnification model that runs in real time for full-HD resolution videos. Due to the specified network design of the prior art, i.e. inhomogeneous architecture, the direct application of existing neural architecture search methods is complicated. Instead of automatic search, we carefully investigate the architecture module by module for its role and importance in the motion magnification task. Two key findings are 1) Reducing the spatial resolution of the latent motion representation in the decoder provides a good trade-off between computational efficiency and task quality, and 2) surprisingly, only a single linear layer and a single branch in the encoder are sufficient for the motion magnification task. Based on these findings, we introduce a real-time deep learning-based motion magnification model with4.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality.

Revisiting Learning-based Video Motion Magnification for Real-time Processing

TL;DR

A real-time deep learning-based motion magnification model that runs in real time for full-HD resolution videos with 2.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality is introduced.

Abstract

Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully demonstrates the modelling of the motion magnification problem with outstanding quality compared to conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being extended to various online applications. In this paper, we investigate an efficient deep learning-based motion magnification model that runs in real time for full-HD resolution videos. Due to the specified network design of the prior art, i.e. inhomogeneous architecture, the direct application of existing neural architecture search methods is complicated. Instead of automatic search, we carefully investigate the architecture module by module for its role and importance in the motion magnification task. Two key findings are 1) Reducing the spatial resolution of the latent motion representation in the decoder provides a good trade-off between computational efficiency and task quality, and 2) surprisingly, only a single linear layer and a single branch in the encoder are sufficient for the motion magnification task. Based on these findings, we introduce a real-time deep learning-based motion magnification model with4.2X fewer FLOPs and is 2.7X faster than the prior art while maintaining comparable quality.
Paper Structure (25 sections, 5 equations, 19 figures, 1 table)

This paper contains 25 sections, 5 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Computational cost comparison between the architectures of Oh et al. oh2018learning and ours. Our model has $24.4\%$ lower number of parameters and $4.2\times$ fewer FLOPs than the prior art oh2018learning, achieving $2.7\times$ faster computational time. Frame-per-second (FPS) is measured for Full-HD (FHD; $1920 \times 1080$) resolution videos. FLOPs are calculated for input frames of resolution $384 \times 384$.
  • Figure 2: Comparison of linearity of motion magnification along amplification factor. Learning-based methods including Oh et al.oh2018learning and ours show linear magnification of input motion upon given amplification factor $\alpha$, whereas signal processing based methods including Wu et al.wu2012eulerian and Wadhwa et al.wadhwa2013phase show distorted and attenuated motion magnification. The peak-to-peak displacements are estimated by Kanade-Lucas-Tomasi tracking algorithm tomasi1991detection.
  • Figure 3: Overall architecture and specification of the baseline. The baseline, Oh et al.oh2018learning, consists of three modules: the encoder (Enc.), the manipulator (Man.), and the decoder (Dec.). Encoders are weight-shared. Given the two input frames, $I_t$ and $I_{t+1}$, the encoder takes each frame and outputs shape representation, $S_i$, and texture representation, $T_i$. The manipulator takes $S_t$ and $S_{t+1}$ and magnifies the motion by multiplying the amplification factor $\alpha$. The decoder reconstructs the magnified frame $\tilde{I}$.
  • Figure 4: An example of the relationship between SSIM and visual similarity. To investigate the relationship between SSIM and the visual similarity, we train five networks that have different SSIM values for synthetic data and observe their magnified frame for real video sequence, i.e., crane. The five networks are trained by varying the entire channel dimension of baseline. Frames to the right of the red line are visually more similar to the input reference frame than those on the left, suggesting that an visually acceptable SSIM threshold can be established.
  • Figure 5: Relationship between SSIM and visual similarity scored by humans. Prior to the human study, the investigators are trained on examples of motion magnification task to become acquainted with the task. Then, we provide investigators a description "Please rate the visual similarity between the input image and the magnified frames on a scale of 0 to 5, where a score of 3 is considered sufficiently similar." The five networks have different SSIM values on synthetic dataset and they produce the magnified image for each input of real data sample. Note that the five networks are trained by varying the entire channel dimension for the baseline oh2018learning architecture. Red line denotes the value of 0.910, which corresponds to the score of 3.
  • ...and 14 more figures