Table of Contents
Fetching ...

Dense Video Object Captioning from Disjoint Supervision

Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

TL;DR

A unified model is proposed for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video, and it is demonstrated how the end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models.

Abstract

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models. Moreover, we propose a training strategy based on a mixture of disjoint tasks, which allows us to leverage diverse, large-scale datasets which supervise different parts of our model. Although each pretraining task only provides weak supervision, they are complementary and, when combined, result in noteworthy zero-shot ability and serve as strong initialization for additional finetuning to further improve accuracy. We carefully design new metrics capturing all components of our task, and show how we can repurpose existing video grounding datasets (e.g. VidSTG and VLN) for our new task. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on VidSTG and VLN, without explicitly training for it. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/densevoc.

Dense Video Object Captioning from Disjoint Supervision

TL;DR

A unified model is proposed for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video, and it is demonstrated how the end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models.

Abstract

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models. Moreover, we propose a training strategy based on a mixture of disjoint tasks, which allows us to leverage diverse, large-scale datasets which supervise different parts of our model. Although each pretraining task only provides weak supervision, they are complementary and, when combined, result in noteworthy zero-shot ability and serve as strong initialization for additional finetuning to further improve accuracy. We carefully design new metrics capturing all components of our task, and show how we can repurpose existing video grounding datasets (e.g. VidSTG and VLN) for our new task. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on VidSTG and VLN, without explicitly training for it. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/densevoc.
Paper Structure (29 sections, 10 equations, 4 figures, 13 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 4 figures, 13 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the dense video object captioning (Dense VOC) task. Given a video, we predict object trajectories (identities denoted by colors) and their natural language description. We show a video from the VidSTG zhang2020does validation set.
  • Figure 2: Overview of Dense VOC. Our problem involves understanding across space, time, and language, and thus encompasses other vision tasks, which typically consider one or two of these axes. We show these subtasks are complementary, and pretraining on them enables zero-shot generalization to Dense VOC.
  • Figure 3: Overview of our model. Our end-to-end model has three modules: First it produces object proposals per-frame using a class-agnostic detector (left, trained with detection loss, $L_{object}$). These object proposals are then passed to an end-to-end tracking module that groups objects into trajectories (middle, trained with association loss, $L_{assoc}$). The identities produced by the tracking module are used to aggregate features which are then fed to a language decoder to produce the final caption (right, trained with caption loss $L_{caption}$). Our model can be trained end-to-end with partial supervision on different and disjoint datasets to provide zero-shot Dense VOC capabilities.
  • Figure 4: Qualitative results on VidSTG. Our model captures motion (1st row) and handles crowded scenes (2nd row). However, it may misrecognize objects (2nd row, "dog" should be "goat") and action boundaries (2nd row, "chasing" before it occurs).