Table of Contents
Fetching ...

Pre-training for Action Recognition with Automatically Generated Fractal Datasets

Davyd Svyezhentsev, George Retsinas, Petros Maragos

TL;DR

Using fractal geometry, methods are presented to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models in the task of action recognition.

Abstract

In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video .

Pre-training for Action Recognition with Automatically Generated Fractal Datasets

TL;DR

Using fractal geometry, methods are presented to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models in the task of action recognition.

Abstract

In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video .

Paper Structure

This paper contains 18 sections, 10 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of the proposed approach. Aiming to pre-train neural models, we utilize fractal geometry and automatically construct large-scale datasets of short synthetic video clips (Sec. \ref{['sub:fractal_animation']}). We additionally narrow the domain gap between real and synthetic videos by identifying key properties of the former and emulating them during pre-training (Sec. \ref{['sub:domain_adaptation']}). The transferability of the proposed datasets and transformations is experimentally assessed by fine-tuning the pre-trained models on real action recognition benchmarks (Sec. \ref{['sec:experiments']})
  • Figure 2: Examples of rendered 2D IFS attractors. A subset of linear samples exhibits unsatisfactory geometry, being either too sparse or too dense. Adding nonlinearity significantly alters the distribution of produced images and boosts overall diversity.
  • Figure 3: An example of the the proposed animation method compared to naive interpolation (Sec. \ref{['sub:fractal_animation']}). The latter often results in undesired sparseness in the intermediate frames. The former mitigates this issue. More samples are displayed in Fig. \ref{['fig:appendix_fractal_video_linear']} and \ref{['fig:appendix_fractal_video_nonlinear']} of Appendix \ref{['appendix:sup_visual_material']}.
  • Figure 4: Examples of the proposed nonlinear interpolation curves. The objective is to approximate the complexity of real human motion.
  • Figure 5: Examples of the proposed domain adaptation methods. The purpose of these augmentations is to narrow the domain gap between real and synthetic videos.
  • ...and 6 more figures