Table of Contents
Fetching ...

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing

TL;DR

OW-VISCapTor addresses open-world video instance segmentation and captioning by introducing two abstractors that connect vision features to a frozen language model: an object abstractor that generates spatially rich open-world and closed-world object queries, and an object-to-text abstractor that uses masked cross-attention to produce fine-grained object-centric captions via an LLM. The approach is trained with a composite loss including inter-query contrastive loss to diversify queries, and both detection/segmentation and captioning objectives, while the LLM remains frozen. Evaluations on BURST (OW-VIS) and VidSTG (Dense VOC) show substantial improvements over a generalized baseline and specialized SOTAs, notably 13% gains on unseen categories and 10% gains in captioning accuracy, demonstrating robust open-world discovery and captioning in online video processing. The work advances generalizable scene understanding by tightly coupling spatially rich object queries with language-driven description, enabling richer, object-centric interpretation of videos with unseen objects and scenes.

Abstract

We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

TL;DR

OW-VISCapTor addresses open-world video instance segmentation and captioning by introducing two abstractors that connect vision features to a frozen language model: an object abstractor that generates spatially rich open-world and closed-world object queries, and an object-to-text abstractor that uses masked cross-attention to produce fine-grained object-centric captions via an LLM. The approach is trained with a composite loss including inter-query contrastive loss to diversify queries, and both detection/segmentation and captioning objectives, while the LLM remains frozen. Evaluations on BURST (OW-VIS) and VidSTG (Dense VOC) show substantial improvements over a generalized baseline and specialized SOTAs, notably 13% gains on unseen categories and 10% gains in captioning accuracy, demonstrating robust open-world discovery and captioning in online video processing. The work advances generalizable scene understanding by tightly coupling spatially rich object queries with language-driven description, enabling richer, object-centric interpretation of videos with unseen objects and scenes.

Abstract

We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.
Paper Structure (30 sections, 4 equations, 11 figures, 6 tables)

This paper contains 30 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our method, OW-VISCapTor, can simultaneously detect, segment, track, and caption objects in the given video frames. The first example (top row) shows a road scene with a previously unseen trailer truck and cars that are seen during training. The second example (bottom row) shows a person on a lawn mower and a dog on the grass. The lawn mower isn't part of the training set. We generate meaningful object-centric captions even for objects never seen during training. The captions for unseen objects are underlined.
  • Figure 2: Overview of OW-VISCapTor (Sec. \ref{['sec:app:overview']}): an object abstractor (Sec. \ref{['sec:app:owqueries']}) connects the image feature space to the object query space, and an object-to-text abstractor (Sec. \ref{['sec:app:caphead']}) connects the object query space to the text query space. DH and CH stand for detection head and classification head.
  • Figure 3: The proposed abstractors. (a) The object abstractor generates spatially rich open-world object queries $q_\mathrm{ow}$ from open-world embeddings $e_\mathrm{ow}$, and closed-world object queries $q_\mathrm{cw}$ from closed-world embeddings $e_\mathrm{cw}$. The open-world embeddings $e_\mathrm{ow}$ are generated by encoding a grid of points via a prompt encoder. The closed-world embeddings are learnt. (b) The object-to-text abstractor generates the object-centric text queries (e.g., $q^i_\mathrm{text}$ for the $i^\mathrm{th}$ object) that the frozen LLM uses for object-centric captioning. There are $L$ transformer blocks in the object-to-text abstractor, each one consisting of self-attention (SA), masked cross-attention (Masked CA), and a feed forward network (FFN).
  • Figure 4: Example from the BURST validation data. The masks are superimposed on the objects. The top row shows examples of parachutes in the air and people on the grass. The parachutes belong to the uncommon object category, i.e., parachutes were never seen during training. Our approach detects and retains the identities of the blue and the green parachutes as the green parachute crosses the blue one. The bottom row shows a person unboxing a leaf blower. The carton of the leaf blower (gray mask), the leaf blower (maroon mask), and the plastic wrapper (pink mask) are never seen during training. We can consistently detect, segment, and track them along with the person (common object category during training).
  • Figure 5: An example from the VidSTG data. Our approach is able to detect and track objects in the scene consistently and to generate meaningful object-centric captions for each of the detected objects.
  • ...and 6 more figures