Table of Contents
Fetching ...

BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

TL;DR

This paper proposes BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni- modal encoders and each layer of the cross-modal encoder, which enables effective bottom-up cross-Modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni -modal Encoder.

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at https://github.com/microsoft/BridgeTower.

BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

TL;DR

This paper proposes BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni- modal encoders and each layer of the cross-modal encoder, which enables effective bottom-up cross-Modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni -modal Encoder.

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at https://github.com/microsoft/BridgeTower.
Paper Structure (38 sections, 6 equations, 5 figures, 14 tables)

This paper contains 38 sections, 6 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: (a) -- (d) are four categories of current Two-Tower vision-language models; (e) gives a brief illustration of the BridgeTower architecture. VE, TE, and CE are short for the Visual Encoder, Textual Encoder, and Cross-modal Encoder, respectively. The height of each rectangle represents its relative computational cost. $\text{VE}=\text{TE}$ indicates that the visual encoder and the textual encoder have the same or a similar number of parameters or computational costs. Illustration inspired by ViLT.
  • Figure 2: Illustration of BridgeTower. BridgeTower consists of a $12$-layer textual encoder, a $12$-layer visual encoder, and a 6-layer cross-modal encoder. Each of the top $6$ layers of the visual and textual encoders is connected to each layer of the cross-modal encoder via bridge layers, which brings bottom-up alignment and fusion to the cross-modal encoder.
  • Figure 3: The KL divergence between attention distributions of different heads (small dots) and the averaged KL divergence (large dots) in each layer w.r.t. the layer number on the self-/cross-attention of the visual/textual part of the cross-modal encoder in the Meter and BridgeTower models.
  • Figure 4: (a) & (b) give brief illustration of BridgeTower and Meter architectures; (c) gives a brief illustration of a $3$-layer cross-modal encoder with $2$ internal cross-modal layers and $1$ external cross-modal layer; (d) gives a brief illustration of how Meter explores simple multi-layer feature fusion via weighted sum of uni-modal representations of all layers.
  • Figure 5: Visualization of the cross-attention map of our BridgeTower and Meter. The example comes from the VQAv2 validation set. Predictions come from the fine-tuning checkpoints of both models.