Table of Contents
Fetching ...

SDI-Paste: Synthetic Dynamic Instance Copy-Paste for Video Instance Segmentation

Sahir Shrestha, Weihao Li, Gao Zhu, Nick Barnes

TL;DR

This paper names their video data augmentation pipeline Synthetic Dynamic Instance Copy-Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence.

Abstract

Data augmentation methods such as Copy-Paste have been studied as effective ways to expand training datasets while incurring minimal costs. While such methods have been extensively implemented for image level tasks, we found no scalable implementation of Copy-Paste built specifically for video tasks. In this paper, we leverage the recent growth in video fidelity of generative models to explore effective ways of incorporating synthetically generated objects into existing video datasets to artificially expand object instance pools. We first procure synthetic video sequences featuring objects that morph dynamically with time. Our carefully devised pipeline automatically segments then copy-pastes these dynamic instances across the frames of any target background video sequence. We name our video data augmentation pipeline Synthetic Dynamic Instance Copy-Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence. Extensive experiments on the popular Youtube-VIS 2021 dataset using two separate popular networks as baselines achieve strong gains of +2.9 AP (6.5%) and +2.1 AP (4.9%). We make our code and models publicly available.

SDI-Paste: Synthetic Dynamic Instance Copy-Paste for Video Instance Segmentation

TL;DR

This paper names their video data augmentation pipeline Synthetic Dynamic Instance Copy-Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence.

Abstract

Data augmentation methods such as Copy-Paste have been studied as effective ways to expand training datasets while incurring minimal costs. While such methods have been extensively implemented for image level tasks, we found no scalable implementation of Copy-Paste built specifically for video tasks. In this paper, we leverage the recent growth in video fidelity of generative models to explore effective ways of incorporating synthetically generated objects into existing video datasets to artificially expand object instance pools. We first procure synthetic video sequences featuring objects that morph dynamically with time. Our carefully devised pipeline automatically segments then copy-pastes these dynamic instances across the frames of any target background video sequence. We name our video data augmentation pipeline Synthetic Dynamic Instance Copy-Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence. Extensive experiments on the popular Youtube-VIS 2021 dataset using two separate popular networks as baselines achieve strong gains of +2.9 AP (6.5%) and +2.1 AP (4.9%). We make our code and models publicly available.

Paper Structure

This paper contains 11 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our proposed data-augmentation framework generates synthetic object instances that are temporally dynamic and copy-pastes them using a linear trajectory onto each frame of a video sequence ($F_{1}, F_{2},..., F_{N_f}$). Our aim is to increase instance population of any existing video dataset.
  • Figure 2: Illustration of our SDI-Paste pipeline. Firstly, Synthetic Video Generation uses text prompts to obtain diverse video scenes. Secondly, frames in each scene are segmented to acquire synthetic dynamic object instances. Finally, using a linear-random trajectory scheme, these dynamic instances are copy-pasted onto existing video sequences to compose the augmented dataset.
  • Figure 3: Examples of dynamic video frames generated with AnimateDiff guo2023animatediff. We can observe single salient foreground objects undergoing seamless shape and viewpoint transitions as a result of their actions. Object features are mostly preserved barring some aberrations such as extra ears or feet.
  • Figure 4: Figure showing linear instance placement trajectory. The direction $\theta$ is constant for all frames but the displacement $\Delta$ varies frame by frame.
  • Figure 5: Example of dynamic video frames obtained after instance composition. Dynamic Instances are copy-pasted onto a background image with its existing objects to enlarge the instance pool for each sequence.