Table of Contents
Fetching ...

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

Rajat Modi, Vibhav Vineet, Yogesh Singh Rawat

TL;DR

Simple yet effective training recipes are derived which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation) and models leveraging these recipes outperform existing video action-detectors under occlusion.

Abstract

This paper explores the impact of occlusions in video action detection. We facilitate this study by introducing five new benchmark datasets namely O-UCF and O-JHMDB consisting of synthetically controlled static/dynamic occlusions, OVIS-UCF and OVIS-JHMDB consisting of occlusions with realistic motions and Real-OUCF for occlusions in realistic-world scenarios. We formally confirm an intuitive expectation: existing models suffer a lot as occlusion severity is increased and exhibit different behaviours when occluders are static vs when they are moving. We discover several intriguing phenomenon emerging in neural nets: 1) transformers can naturally outperform CNN models which might have even used occlusion as a form of data augmentation during training 2) incorporating symbolic-components like capsules to such backbones allows them to bind to occluders never even seen during training and 3) Islands of agreement can emerge in realistic images/videos without instance-level supervision, distillation or contrastive-based objectives2(eg. video-textual training). Such emergent properties allow us to derive simple yet effective training recipes which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation). Models leveraging these recipes outperform existing video action-detectors under occlusion by 32.3% on O-UCF, 32.7% on O-JHMDB & 2.6% on Real-OUCF in terms of the vMAP metric. The code for this work has been released at https://github.com/rajatmodi62/OccludedActionBenchmark.

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes

TL;DR

Simple yet effective training recipes are derived which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation) and models leveraging these recipes outperform existing video action-detectors under occlusion.

Abstract

This paper explores the impact of occlusions in video action detection. We facilitate this study by introducing five new benchmark datasets namely O-UCF and O-JHMDB consisting of synthetically controlled static/dynamic occlusions, OVIS-UCF and OVIS-JHMDB consisting of occlusions with realistic motions and Real-OUCF for occlusions in realistic-world scenarios. We formally confirm an intuitive expectation: existing models suffer a lot as occlusion severity is increased and exhibit different behaviours when occluders are static vs when they are moving. We discover several intriguing phenomenon emerging in neural nets: 1) transformers can naturally outperform CNN models which might have even used occlusion as a form of data augmentation during training 2) incorporating symbolic-components like capsules to such backbones allows them to bind to occluders never even seen during training and 3) Islands of agreement can emerge in realistic images/videos without instance-level supervision, distillation or contrastive-based objectives2(eg. video-textual training). Such emergent properties allow us to derive simple yet effective training recipes which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation). Models leveraging these recipes outperform existing video action-detectors under occlusion by 32.3% on O-UCF, 32.7% on O-JHMDB & 2.6% on Real-OUCF in terms of the vMAP metric. The code for this work has been released at https://github.com/rajatmodi62/OccludedActionBenchmark.

Paper Structure

This paper contains 26 sections, 1 equation, 15 figures, 14 tables.

Figures (15)

  • Figure 1: A Toy Experiment. (i-iii) superimposing a single occluder (bus) on an actor and varying its size results in drops as large as 50%. (iv) even a simple occluder (cat) in background results in 30% drop. (v) highest drops are observed if background is entirely masked. (vi-ix) Clean refers to all methods evaluated on unoccluded test sets. Best viewed when zoomed in.
  • Figure 2: Sample video frames from proposed benchmark datasets. (i) occlusion severity increases across 9 severity levels and both actor/background region. (ii) occluders are sampled from both indoor/outdoor splits. (iii) Our O-UCF and O-JHMDB simulate 6 dynamic occluder motions like circle, linear, zoom-in etc. (iv) Similary, the proposed Real-OUCF is a dataset for realistic scenarios where multiple actors mutually-occlude each other. Best viewed in color.
  • Figure 3: Effect of pretrained weights: VCAPS-Mvitv2 outperforms all models on both O-UCF and O-JHMDB. (X Axis:) Accuracy without occlusions. (Y Axis): Relative Robustness of Model. Top-Right corner of each plot corresponds to most robust model.
  • Figure 4: Effect of number of model parameters: (i) Increasing parameter size within the same model family yields more robustness. (X Axis:) Accuracy without occlusions. (Y Axis): Relative Robustness of Model. Top-Right corner of each plot corresponds to most robust model.
  • Figure 5: Emergent object/occluder separation in capsules. (top left) An occluded video is feed-forwarded through VCAPS-Mvitv2 and activations of primary layer capsules are visualized. Capsules can (i) parse an actor into constituent body parts without instance-level supervisionhinton1976usinghinton1977relaxation (ii) segment multiple actors & objects of action (volleyball net). (iii) show evidence of focusing on occluders never seen during training (iv) undergo representational collapse, where a single capsule starts representing multiple objects (wholes) when number of objects in the scene are greater than number of capsules in the network. Best viewed in color.
  • ...and 10 more figures