ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu; Yi Yang

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

TL;DR

ActBERT proposes a self-supervised framework for joint video-text representation learning by integrating global action cues, local regional object cues, and natural language descriptions. A novel TaNgled Transformer (TNT) encodes three information sources in parallel with cross-modal guidance from actions to enhance language-vision and region-vision interactions. The model is pre-trained with four surrogate tasks on HowTo100M, enabling robust transfer to downstream tasks such as text-video retrieval, captioning, VideoQA, action segmentation, and action step localization, where it achieves state-of-the-art results. This approach advances video-language modeling by preserving fine-grained local cues while capturing global human intent, with broad implications for scalable, instruction-driven vision-language systems.

Abstract

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.

ActBERT: Learning Global-Local Video-Text Representations

TL;DR

Abstract

ActBERT: Learning Global-Local Video-Text Representations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)