COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox
TL;DR
The paper tackles cross-modal video-text representation learning by explicitly modeling hierarchy across granularities (frames/words, clips/sentences, videos/paragraphs). It introduces COOT, a Cooperative Hierarchical Transformer with a) attention-aware feature aggregation for intra-level fusion, b) a contextual transformer for inter-level interactions, and c) a cross-modal cycle-consistency loss to align video and text in a shared embedding space. Through extensive ablations and evaluations on ActivityNet-captions and YouCook2, COOT achieves state-of-the-art retrieval and captioning performance while using substantially fewer parameters than prior methods. The results underscore the value of hierarchical interactions and cycle-consistency in capturing long-range semantics for video-language tasks, with practical implications for scalable video search and indexing.
Abstract
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext
