Table of Contents
Fetching ...

Deep video representation learning: a survey

Elham Ravanbakhsh, Yongqing Liang, J. Ramanujam, Xin Li

TL;DR

This survey analyzes deep video representation learning by separating features into dense (appearance-rich) and sparse (structure-focused) spatial categories and framing temporal modeling as frame-level or chunk-level. It assesses robustness under occlusion, illumination, view, and background changes, and discusses how extra modules (part information, additional modalities, and attention) can bolster performance. The work contrasts action recognition and video object segmentation to illustrate tradeoffs between co-occurrence learning and computational cost, highlighting multi-modal and attention-based approaches as promising directions. Overall, it provides a taxonomy, practical guidance, and a roadmap for future research in efficient, robust, multi-modal video feature learning.

Abstract

This paper provides a review on representation learning for videos. We classify recent spatiotemporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.

Deep video representation learning: a survey

TL;DR

This survey analyzes deep video representation learning by separating features into dense (appearance-rich) and sparse (structure-focused) spatial categories and framing temporal modeling as frame-level or chunk-level. It assesses robustness under occlusion, illumination, view, and background changes, and discusses how extra modules (part information, additional modalities, and attention) can bolster performance. The work contrasts action recognition and video object segmentation to illustrate tradeoffs between co-occurrence learning and computational cost, highlighting multi-modal and attention-based approaches as promising directions. Overall, it provides a taxonomy, practical guidance, and a roadmap for future research in efficient, robust, multi-modal video feature learning.

Abstract

This paper provides a review on representation learning for videos. We classify recent spatiotemporal feature learning methods for sequential visual data and compare their pros and cons for general video analysis. Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding. Existing features can be generally categorized into spatial and temporal features. Their effectiveness under variations of illumination, occlusion, view and background are discussed. Finally, we discuss the remaining challenges in existing deep video representation learning studies.
Paper Structure (17 sections, 2 figures, 14 tables)

This paper contains 17 sections, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Classification of deep video representation learning schemes.
  • Figure 2: Dense (RGB frames) and sparse (skeleton keypoints) features in Action Recognition. Dense features may include background information; while sparse features encode mainly essential object structure. The usefulness of background varies: it can either distract (e.g., in dancing images) or assist (e.g., activities in the right two columns) recognition. RGB images in the upper left four columns are from pexels; we put them in a sequence of frames. Skeleton keypoints in the lower left four columns are from lv2022complexity. Images in the right two columns are from UCF101 dataset soomro2012ucf101.