Table of Contents
Fetching ...

DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning

Zeyi Bo, Wuxi Sun, Ye Jin

TL;DR

This work tackles the overhead of fine-tuning large Video Transformer backbones for multiple tasks by enabling cross-task knowledge transfer through prompts. It introduces DLM-VMTL, a Double-Layer Mapper that learns prompts from auxiliary tasks' intermediate representations via a first-layer self-attention module and aligns them to the primary task with a second-layer adapter, while keeping backbones frozen. The approach is validated on 11 datasets across 6 video tasks, achieving superior performance with only $10.8\%$ of the total backbone parameters and demonstrating improved zero-shot transfer. Overall, DLM-VMTL provides a scalable, plug-in solution for heterogeneous data video multi-task prompt learning with strong empirical gains.

Abstract

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Model or pre-training the model designed for the specific task, incurs a lot of overhead. How to make these models play other values than their own tasks becomes a worthy question. Multi-Task Learning(MTL) makes the visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It's different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.

DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning

TL;DR

This work tackles the overhead of fine-tuning large Video Transformer backbones for multiple tasks by enabling cross-task knowledge transfer through prompts. It introduces DLM-VMTL, a Double-Layer Mapper that learns prompts from auxiliary tasks' intermediate representations via a first-layer self-attention module and aligns them to the primary task with a second-layer adapter, while keeping backbones frozen. The approach is validated on 11 datasets across 6 video tasks, achieving superior performance with only of the total backbone parameters and demonstrating improved zero-shot transfer. Overall, DLM-VMTL provides a scalable, plug-in solution for heterogeneous data video multi-task prompt learning with strong empirical gains.

Abstract

In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Model or pre-training the model designed for the specific task, incurs a lot of overhead. How to make these models play other values than their own tasks becomes a worthy question. Multi-Task Learning(MTL) makes the visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It's different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.
Paper Structure (13 sections, 5 equations, 3 figures, 2 tables)

This paper contains 13 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The pipelines of conventional Multi-Task Learning(left) proposed new paradigm for Heterogeneous Data Video Multi-Task Prompt Learning(right).
  • Figure 2: (a) An overview of our DLM-VMTL. Remarkably, we describe case where one auxiliary task prompts the primary task. The case of multiple auxiliary tasks adds more prompts at prompt tuning stage. (b) The detailed structure of DLM.
  • Figure 3: Comparison between single-layer and double-layer structures. Dashed lines denotes single-layer structure.