Table of Contents
Fetching ...

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Ji Zhang, Fei Huang

TL;DR

COPA tackles the challenge of efficient vision-language pre-training by endowing ViT-based models with fine-grained patch-text alignment. It introduces Patch-Text Alignment (PTA) and a Text-aware Patch Detector (TPD) that convert object-level cues into patch-level supervision and selectively retain patches most relevant to the input text, reducing visual sequence length and accelerating inference. The approach is trained end-to-end with a joint objective L = L_ITC + L_ITM + L_MLM + L_Prefix + L_PTA using only 5% object annotations, enabling scalable pretraining on 4M image-text pairs and achieving an 88% speedup with competitive or superior downstream performance. By enabling higher-resolution finetuning and providing strong results across VQA, captioning, retrieval, and grounding, COPA demonstrates a practical pathway to scalable, efficient, text-guided vision-language modeling without heavy object detectors. The method also shows potential in extending to single-stream architectures, preserving speedups while offering flexibility for various VL tasks.

Abstract

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

TL;DR

COPA tackles the challenge of efficient vision-language pre-training by endowing ViT-based models with fine-grained patch-text alignment. It introduces Patch-Text Alignment (PTA) and a Text-aware Patch Detector (TPD) that convert object-level cues into patch-level supervision and selectively retain patches most relevant to the input text, reducing visual sequence length and accelerating inference. The approach is trained end-to-end with a joint objective L = L_ITC + L_ITM + L_MLM + L_Prefix + L_PTA using only 5% object annotations, enabling scalable pretraining on 4M image-text pairs and achieving an 88% speedup with competitive or superior downstream performance. By enabling higher-resolution finetuning and providing strong results across VQA, captioning, retrieval, and grounding, COPA demonstrates a practical pathway to scalable, efficient, text-guided vision-language modeling without heavy object detectors. The method also shows potential in extending to single-stream architectures, preserving speedups while offering flexibility for various VL tasks.

Abstract

Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.
Paper Structure (34 sections, 6 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 6 equations, 5 figures, 11 tables, 1 algorithm.

Figures (5)

  • Figure 1: Subfigure (a) illustrates the impact of Text-aware Patch Detector (TPD) in the VQA scenario on various keeping ratios, which is a hyperparameter determining the proportion of retained visual tokens to all tokens. Subfigure (b) demonstrates how Patch-Text Alignment converts object-level annotations to patch-level annotations and optimizes TPD based on the obtained supervision signals. Subfigure (c) presents the VQA accuracy and throughput results for our VLP model and the baseline.
  • Figure 2: (a) Overview of our VLP model ( COPA ). By incorporating the PTA task, we can learn the fine-grained patch-text alignment end-to-end through joint optimization with other pre-training tasks. (b) The architecture of the Text-aware Patch Detector (TPD) is plugged into the ViT-base visual backbone (ViT-TPD). In this sub-figure, we give a simplified example to show how to detect text-relevant patches and calculate the PTA Loss.
  • Figure 3: The single GPU Memory cost of pre-training of COPA with different keeping ratios and detection locations of TPD. We set the batch size to 512, the image size to 256 and the text length to 25. The red line in sub-figure (a) is the GPU memory cost of the baseline model mPLUG li2022mplug
  • Figure 4: The visualization of the VQA case and the detected text-relevant image patches. We set the detection location to 6.
  • Figure 5: The visualization of Accuracy and Recall of TPD on the 10K test dataset randomly sampled from CC cc.