Table of Contents
Fetching ...

Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

Samik Some, Vinay P. Namboodiri

Abstract

Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

Abstract

Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Paper Structure

This paper contains 16 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Examples of fine-grained and coarse annotations in the Cityscapes dataset.
  • Figure 2: An example showing how well SAM 2 can predict masks for past and future frames, given a manually annotated frame. Here, the 20th frame is manually annotated. Masks for both 10th and 30th frames were generated using SAM 2. We observe that SAM 2 can track and propagate the masks for objects present in the 20th frame very well. However, it cannot, by design, tell us anything about new objects introduced in frames 10 and 30.
  • Figure 3: A couple of examples from the Cityscapes dataset demonstrating how well SAM refines some of the segmentation classes. As can be seen, masks for objects like cars, traffic signs, poles and people have been refined quite nicely and can be treated as fine-grained instead of being only a rough approximation.
  • Figure 4: Plot showing changes in mean Intersection over Union (mIoU) against increasing percentage of coarse annotations in the training data for TMANet on the Cityscapes dataset. The non-refined and refined values refer to the coarse annotations being used as-is versus being refined using SAM.
  • Figure 5: Plot showing changes in mean Intersection over Union (mIoU) against increasing percentage of coarse annotations in the training data for TDNet on the Cityscapes dataset. The non-refined and refined values refer to the coarse annotations being used as-is versus being refined using SAM.