Table of Contents
Fetching ...

MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Sangho Lee, Il Yong Chun, Hogun Park

TL;DR

The paper tackles the problem of fixed-frame sampling in video captioning, which can miss critical information or introduce redundancy. It introduces MAMS, a model-agnostic framework that adaptively selects a caption-generation module size and constructs per-video visual-token subsets, complemented by an adaptive attention masking scheme. The approach yields consistent improvements across three benchmark datasets and three captioning models, with mPLUG-2 achieving a new state-of-the-art CIDEr score. This work offers a versatile strategy for adaptive input sizing in multi-modal transformers and has implications for extending to other video understanding tasks.

Abstract

Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.

MAMS: Model-Agnostic Module Selection Framework for Video Captioning

TL;DR

The paper tackles the problem of fixed-frame sampling in video captioning, which can miss critical information or introduce redundancy. It introduces MAMS, a model-agnostic framework that adaptively selects a caption-generation module size and constructs per-video visual-token subsets, complemented by an adaptive attention masking scheme. The approach yields consistent improvements across three benchmark datasets and three captioning models, with mPLUG-2 achieving a new state-of-the-art CIDEr score. This work offers a versatile strategy for adaptive input sizing in multi-modal transformers and has implications for extending to other video understanding tasks.

Abstract

Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The overview of the proposed framework.
  • Figure 2: BLEU-4 scores papineni2002bleu with different numbers of video frames in SwinBERT (the MSVD datasets). The dotted, dashed, and solid lines represent different experiments involving SwinBERT, with and without an attention mask, and the proposed MAMS framework.
  • Figure 3: The overall MAMS framework
  • Figure 4: Illustration of proposed module and token selector
  • Figure 5: Illustration of the proposed adaptive attention masking scheme
  • ...and 1 more figures