Table of Contents
Fetching ...

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

Yicheng Wang, Zhikang Zhang, Jue Wang, David Fan, Zhenlin Xu, Linda Liu, Xiang Hao, Vimal Bhat, Xinyu Li

TL;DR

GEXIA tackles cross-modal video-language alignment under multi-grained data by two coordinated strategies: Granularity EXpansion (GEX) creates multi-grained datasets from single-grained sources through Integration and Compression, while the Iterative Approximation Module (IAM) embeds variable-length dense features into a fixed low-dimensional space for scalable cross-modal alignment using a VTC-style loss. The approach enables efficient, flexible handling of long-form videos and texts without extensive new data collection, demonstrated by state-of-the-art or competitive results across seven benchmarks, including strong zero-shot performance on long-form tasks. Key contributions include a scalable data-generation pipeline and a general-purpose IAM that adjusts to input granularity via the iteration count, preserving semantic information in compact embeddings. The work also highlights practical aspects such as computational efficiency, effective use of text compression with LLMs, and the potential to extend to new granularities and benchmarks for broader video-language understanding.

Abstract

In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance. Remarkably, our model excels in tasks involving long-form video understanding, even though the pretraining dataset only contains short video clips.

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

TL;DR

GEXIA tackles cross-modal video-language alignment under multi-grained data by two coordinated strategies: Granularity EXpansion (GEX) creates multi-grained datasets from single-grained sources through Integration and Compression, while the Iterative Approximation Module (IAM) embeds variable-length dense features into a fixed low-dimensional space for scalable cross-modal alignment using a VTC-style loss. The approach enables efficient, flexible handling of long-form videos and texts without extensive new data collection, demonstrated by state-of-the-art or competitive results across seven benchmarks, including strong zero-shot performance on long-form tasks. Key contributions include a scalable data-generation pipeline and a general-purpose IAM that adjusts to input granularity via the iteration count, preserving semantic information in compact embeddings. The work also highlights practical aspects such as computational efficiency, effective use of text compression with LLMs, and the potential to extend to new granularities and benchmarks for broader video-language understanding.

Abstract

In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance. Remarkably, our model excels in tasks involving long-form video understanding, even though the pretraining dataset only contains short video clips.

Paper Structure

This paper contains 22 sections, 1 equation, 7 figures, 11 tables.

Figures (7)

  • Figure 1: An overview of the Granularity EXpansion (GEX) pipeline, which expands a single-grained dataset into a multi-grained dataset with video and text integration $\oplus_v$ and $\oplus_t$ and text compression $\Theta_t$ operations.
  • Figure 2: An overview of GEXIA, which consists of Granularity EXpansion (GEX), Dense Feature Extraction, Iterative Approximation Module (IAM), and Cross-modal Alignment. We propose GEX and IAM to address data and modeling challenges, respectively.
  • Figure 3: Zero-shot T2V retrieval results with different $\#iter$ for long video/text data on the ActivityNet Captions dataset.
  • Figure 4: t-SNE visualization of the CLIP-based features for the sampled 100 concatenated long videos, concatenated long texts, and summarized short texts.
  • Figure 5: Average model inference run time on ActivityNet Captions across different $\#iter$ setups for GEXIA, compared to CLIP4Clip (mean pooling setup).
  • ...and 2 more figures