Table of Contents
Fetching ...

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Khoi Le, Zhiyuan Hu, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

TL;DR

A novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability and Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into the authors' READ modules is proposed.

Abstract

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling

TL;DR

A novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability and Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into the authors' READ modules is proposed.

Abstract

Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.
Paper Structure (14 sections, 7 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Examples of the TLG and VLS problems. TLG model needs to understand the meaning of language entities such as proposal or girl, and the existence of expression in video frames. VLS model is expected to recognize salient information, e.g. crank bolt, bottom bracket from the language, and bicycle from the video.
  • Figure 2: Comparison of our proposed READ method with the full fine-tuning and other parameter-efficient fine-tuning methods. For each method, we denote the mAP gain averaged over the domains of the YouTube Highlights dataset together with the number of trainable parameters.
  • Figure 3: Overall illustration of the proposed recurrent adapter (READ) and partial video-language alignment (PVLA) framework.