Table of Contents
Fetching ...

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

TL;DR

The paper tackles cross-modal video-text representation learning by explicitly modeling hierarchy across granularities (frames/words, clips/sentences, videos/paragraphs). It introduces COOT, a Cooperative Hierarchical Transformer with a) attention-aware feature aggregation for intra-level fusion, b) a contextual transformer for inter-level interactions, and c) a cross-modal cycle-consistency loss to align video and text in a shared embedding space. Through extensive ablations and evaluations on ActivityNet-captions and YouCook2, COOT achieves state-of-the-art retrieval and captioning performance while using substantially fewer parameters than prior methods. The results underscore the value of hierarchical interactions and cycle-consistency in capturing long-range semantics for video-language tasks, with practical implications for scalable video search and indexing.

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

TL;DR

The paper tackles cross-modal video-text representation learning by explicitly modeling hierarchy across granularities (frames/words, clips/sentences, videos/paragraphs). It introduces COOT, a Cooperative Hierarchical Transformer with a) attention-aware feature aggregation for intra-level fusion, b) a contextual transformer for inter-level interactions, and c) a cross-modal cycle-consistency loss to align video and text in a shared embedding space. Through extensive ablations and evaluations on ActivityNet-captions and YouCook2, COOT achieves state-of-the-art retrieval and captioning performance while using substantially fewer parameters than prior methods. The results underscore the value of hierarchical interactions and cycle-consistency in capturing long-range semantics for video-language tasks, with practical implications for scalable video search and indexing.

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

Paper Structure

This paper contains 45 sections, 10 equations, 5 figures, 19 tables.

Figures (5)

  • Figure 1: Overview of COOT model (best viewed in color). The model consist of two branches: one for video input (top) and one for text input (bottom). Given a video and a corresponding text, we encode them to frame-level/word-level features. Features belonging to each segment (clip/sentence) are fed to a standard temporal transformer (T-Transformer) followed by the proposed feature aggregation module (Attention-FA) to obtain clip/sentence-level features. Finally, a new contextual transformer produces the final video/paragraph embedding based on interactions between local context (clip/sentence features) and global context (all frames/words features). $\ell_{align}^L$, $\ell_{align}^H$, $\ell_{align}^g$ and $\ell_{CMC}$ enforce the model to align the representations at different levels.
  • Figure 2: Contextual Transformer (CoT). This module (right) encourages the model to optimize the representations with respect to interactions between local and global context. In the third sentence, to know the type of dough (cookie) the model should have information about the general context of the video (making chocolate cookies). Likewise, in the second sentence, to know that she is the "same woman", the model must be aware of the person's identity throughout the video.
  • Figure 3: Cross-Modality Cycle-Consistency. Starting from a sentence $s_i$, we find its nearest neighbor in the clip sequence and again its neighbor in the sentence sequence. Deviations from the start index are penalized as alignment error.
  • Figure 4: Noise vs Performance study on ActivityNet-captions dataset (val1)
  • Figure 5: Visualization of the video embedding space with t-SNE on ActivityNet-Captions. We apply t-SNE to reduce the video embedding space to 2 dimensions and visualize videos by one sample frame.