Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Wei Ye; Chaoya Jiang; Haiyang Xu; Chenhao Ye; Chenliang Li; Ming Yan; Shikun Zhang; Songhang Huang; Fei Huang

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang

TL;DR

This work introduces an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection, which progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes.

Abstract

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 7 figures, 11 tables)

This paper contains 33 sections, 3 equations, 7 figures, 11 tables.

Introduction
Related Work
Vision-Language Pre-training
ViTs Acceleration
Method
Model Architecture
Text-Relevant Image Patch Selection
Extension to Generative VLP Models
Extension to Single-stream VLP Models
Pre-training Objectives
Experiment Settings
Implementation Details
pre-training data
Experiment Results
Overall Performance
...and 18 more sections

Figures (7)

Figure 1: Sub-figure (a) presents the VQA cases for ALBEF Li2021AlignBF fine-tuned on the VQA task, where input image tokens are directly selected based on the attention weight of the image [CLS] token in relation to other image tokens. We visualize the attention distribution of the image [CLS] token, which, as seen, naturally concentrates on objects within the images while disregarding the backgrounds. If the question pertains to the objects in the images, accurate predictions can be obtained. Sub-figure (b) compares VQA predictions between ALBEF, which directly selects image tokens guided by the image [CLS] token, and our model, TRIPS-ALBEF. As illustrated, when questions relate to image backgrounds, the former produces incorrect answers, whereas the latter provides the right responses by preserving text-relevant image tokens.
Figure 2: Sub-figure (a) showcases the overall architecture of the VLP model (TRIPS-ALBEF) presented in this paper. Sub-figure (b) provides a visual overview of the ViT augmented with a Text-Relevant Image Patch Selection module (ViT-TRIPS). We assume the ViT-TRIPS comprises 12 layers, and the 5th and 10th layers serve as the patch selection layers. Sub-figure (c) depicts the design of the Text-Relevant Image Patch Selection layer.
Figure 3: Sub-figure (a) depicts the ViT-TRIPS backbone of the single-stream VLP model. Sub-figure (b) is the illustration of the Text-Relevant Image Patch-Selection layer.
Figure 4: The figure below visualizes the performance and inference speed distribution of different VLP models. The y-axis represents the accuracy of the VLP model on the test-dev dataset of VQA, while the x-axis represents the throughput of processing image-text pairs on a single 32G V100 GPU.
Figure 5: Sub-figure (a) visualizes the VQA results of three models (TRIPS-ViLT, TRIPS-ALBEF, TRIPS-mPLUG) at different image resolutions, sub-figure (b) visualizes the throughput of three different models at different image resolutions, and sub-figure (c) visualizes the FLOPs of the three models at different image resolutions.
...and 2 more figures

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

TL;DR

Abstract

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)