Table of Contents
Fetching ...

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

Peng Jin, Hao Li, Li Yuan, Shuicheng Yan, Jie Chen

TL;DR

This work reframes video–text representation learning as a multivariate cooperative game by treating video frames and text words as players and using Hierarchical Banzhaf Interaction (HBI) to capture fine-grained cross-modal coalitions. A representation reconstruction mechanism fuses single-modal granularity with cross-modal adaptability, mitigating BI calculation bias, and an encoder–decoder framework enables flexible downstream task support (text–video retrieval, VideoQA, and captioning). The model employs multi-level BI (entity, action, event) with a KL-based alignment objective, deep supervision, and self-distillation to improve generalization. Empirical results on multiple benchmarks demonstrate state-of-the-art performance and provide insights into the interpretability of hierarchical cross-modal interactions and the efficiency of the approach.

Abstract

Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

TL;DR

This work reframes video–text representation learning as a multivariate cooperative game by treating video frames and text words as players and using Hierarchical Banzhaf Interaction (HBI) to capture fine-grained cross-modal coalitions. A representation reconstruction mechanism fuses single-modal granularity with cross-modal adaptability, mitigating BI calculation bias, and an encoder–decoder framework enables flexible downstream task support (text–video retrieval, VideoQA, and captioning). The model employs multi-level BI (entity, action, event) with a KL-based alignment objective, deep supervision, and self-distillation to improve generalization. Empirical results on multiple benchmarks demonstrate state-of-the-art performance and provide insights into the interpretability of hierarchical cross-modal interactions and the efficiency of the approach.

Abstract

Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.
Paper Structure (17 sections, 16 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 16 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) Previous methods only learn a global semantic interaction from the coarse-grained labels of video-text pairs. (b) We model multimodal alignment as a cooperative game process, utilizing Banzhaf Interaction to evaluate possible correspondence between video frames and text words.
  • Figure 2: The intuition of employing Banzhaf Interaction in video-language representation learning. When certain players (frames and words) form a coalition, it entails the exclusion of these players from potential coalitions with others, rendering them mutually exclusive from the target coalition. Banzhaf Interaction quantifies the disparity between the benefits derived from the coalition and the costs incurred due to the lost coalitions. Therefore, Banzhaf Interaction effectively captures the incremental benefits conferred by the coalition. We refer the reader to Eq. \ref{['BI']} for the detailed formula.
  • Figure 3: Performance comparisons on text-video retrieval, video-question answering, and video captioning. Our proposed framework, HBI V2, designed for general video-language representation learning, demonstrates superior performance consistently. Notably, HBI V2 not only surpasses the previous HBI, but also outperforms existing task-specific methods.
  • Figure 4: Overview of our proposed HBI V2 framework. We employ a dual-stream encoder to extract features for video tokens and text tokens. Subsequently, we reconstruct the original representation by merging single-modal and cross-modal components. We propose a novel proxy training objective, which uses Banzhaf Interaction to evaluate possible correspondence between video tokens and text tokens from various levels. Furthermore, we customize different task-specific prediction heads for various downstream tasks.
  • Figure 5: The representation reconstruction module. To address the bias in calculations within Banzhaf Interaction, we reconstruct both video and text representation as a fusion of single-modal and cross-modal components. The representation reconstruction module maintains the granularity inherent in single-modal representations while preserving the adaptive encoding capabilities of cross-modal representations.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 1