Table of Contents
Fetching ...

Unsupervised Open-Vocabulary Object Localization in Videos

Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He

TL;DR

A method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots is proposed, which is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Unsupervised Open-Vocabulary Object Localization in Videos

TL;DR

A method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots is proposed, which is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.
Paper Structure (32 sections, 7 equations, 14 figures, 4 tables)

This paper contains 32 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We propose an unsupervised approach to localize and name objects in real-world videos. The approach uses slot attention in feature space to localize tubes (second column), assigns text to the slot features via a CLIP model that was modified to allow local feature alignment (third column), and finally merges slots that overlap in text space (last column).
  • Figure 2: Proposed framework. Given an input video, we first localize objects by slot attention with a video encoder pretrained with self-supervision. Next, we extract semantic features for each slot by a patch-based CLIP finetuned from its vanilla version. Then, slots are named by matching slot semantic features to text features from a curated list of text prompts. Finally, the named slots are optimized to alleviate over-segmentation caused by part-whole hierarchies.
  • Figure 3: Patch-based CLIP finetuning. We replace the last ViT layer by a multi-head self-attention module to re-project semantic information to the new patch tokens. We then use cross-attention to encourage those patches that contain the main context to be similar to the $\mathit{CLS}_v$ token. Maybe surprisingly, this suffices to get semantically meaningful patch features. Importantly, we do not use any labeled data during this fine-tuning step.
  • Figure 4: Joint optimization. The image on the left shows one frame of video slots with target names, the image on the right shows the result from merging. The two slots for the bus are merged thus better localizing the object, while they don't further merge with the slot for car since they share different semantics.
  • Figure 5: Three types of slots. We show examples of a slot containing a single object, part of an object or a group of objects. For simplicity we only colorize slots overlapping with objects, and slots in one image are colored differently.
  • ...and 9 more figures