Table of Contents
Fetching ...

Multi-layer Learnable Attention Mask for Multimodal Tasks

Wayner Barrios, SouYoung Jin

TL;DR

This work introduces the Learnable Attention Mask (LAM), a plug-and-play module that globally regulates attention maps in Transformer encoders to better handle long, multimodal sequences. By outputting a token-level mask and enabling per-layer masks, LAM prioritizes salient tokens while reducing redundant computations, with a multi-layer extension capturing information across Transformer stages. Empirical results across MADv2, QVHighlights, ImageNet-1K, and MSRVTT show substantial gains on multimodal tasks and more modest gains on single-modality tasks, while analyses of attention distributions corroborate the efficiency and interpretability benefits. The approach offers a practical, adaptable mechanism for improving multimodal understanding in existing transformer-based architectures, with detailed ablations and qualitative analyses supporting its effectiveness and design choices.

Abstract

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

Multi-layer Learnable Attention Mask for Multimodal Tasks

TL;DR

This work introduces the Learnable Attention Mask (LAM), a plug-and-play module that globally regulates attention maps in Transformer encoders to better handle long, multimodal sequences. By outputting a token-level mask and enabling per-layer masks, LAM prioritizes salient tokens while reducing redundant computations, with a multi-layer extension capturing information across Transformer stages. Empirical results across MADv2, QVHighlights, ImageNet-1K, and MSRVTT show substantial gains on multimodal tasks and more modest gains on single-modality tasks, while analyses of attention distributions corroborate the efficiency and interpretability benefits. The approach offers a practical, adaptable mechanism for improving multimodal understanding in existing transformer-based architectures, with detailed ablations and qualitative analyses supporting its effectiveness and design choices.

Abstract

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.
Paper Structure (37 sections, 12 equations, 8 figures, 3 tables)

This paper contains 37 sections, 12 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) While video and audio tokens naturally align in time, their associations can extend beyond temporal boundaries. For example, "Joanna’s shouts" may correspond to multiple video tokens (i.e. not just v8-11, but also v13-16). (b) The Self-Attention module transformer can capture these attention scores locally, token-versus-token. We introduce the Learnable Attention Mask (LAM), a novel concept that enables a holistic overview of the entire sequence of input tokens, generating a mask that captures attention structures globally.
  • Figure 2: Overview of the Learnable Attention Mask Architecture. The Learnable Attention Mask (LAM) module takes the entire sequence as input and generates a mask. This mask is then used for element-wise multiplication with the attention scores produced by the Transformer Encoders.
  • Figure 3: Qualitative Analysis. This illustration presents a qualitative analysis of a specific instance from the MADv2-eval dataset. It depicts visual and audio signals alongside mask values corresponding to the initial transformer layer (1st layer). Video tokens are represented on the x-axis from 0 to 24, while audio tokens range from 25 to 50 on the same axis.
  • Figure 4: Analysis of Attention Weight Distribution in the Qualitative Example. The plot illustrates the distribution of attention weights within the initial transformer layer across two distinct configurations: employing Learnable Attention Mask (LAM) and full-attention mechanisms. It is evident from the depiction that attention weights under LAM tend to exhibit a leftward bias, resulting in a significant portion approaching 0 or nearing zero. The distribution weights correspond to the same example in Figure \ref{['fig:qualitative_analysis']}.
  • Figure S5: Ablation Studies on the number of layers in LAM and types of mask operation. We conduct an investigation into the impact of varying the number of layers utilized within the Learnable Attention Mask (LAM) framework, as applied in the cross-attention configuration, along with the methods employed for mask fusion with attention weights. The experimentation involves the manipulation of the number of layers, ranging from $2$ to $64$, and explores two distinct fusion techniques: multiplication and addition operations, both implemented at the element-wise level. Evaluation of these experiments is carried out on the validation split set of QVHighlights moment-detr. Overall, notable enhancements in performance, particularly concerning the Average mAP metric for the Moment Retrieval task, are observed. The most substantial improvements are achieved when utilizing $32$ layers within the LAM module.
  • ...and 3 more figures