Table of Contents
Fetching ...

Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

Ping Li, Tao Wang, Xinkui Zhao, Xianghua Xu, Mingli Song

TL;DR

This work proposes a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module, and develops the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words.

Abstract

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at https://github.com/mlvccn/PKG_VidCap

Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

TL;DR

This work proposes a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module, and develops the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words.

Abstract

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at https://github.com/mlvccn/PKG_VidCap

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Motivation illustration (3 vs 1 human). Video is from MSVD chen-acl2011-msvd.
  • Figure 2: Overall framework of Pseudo-labeling with Keyword-refiner and Gated fusion (PKG) method for few-supervised video captioning.
  • Figure 3: Pseudo-label generator.
  • Figure 4: Performance under different $N_{pse}$. Please zoom in for best view.
  • Figure 5: Qualitative results of different pseudo-labeling strategies.
  • ...and 1 more figures