COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment
Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Ji Zhang, Fei Huang
TL;DR
COPA tackles the challenge of efficient vision-language pre-training by endowing ViT-based models with fine-grained patch-text alignment. It introduces Patch-Text Alignment (PTA) and a Text-aware Patch Detector (TPD) that convert object-level cues into patch-level supervision and selectively retain patches most relevant to the input text, reducing visual sequence length and accelerating inference. The approach is trained end-to-end with a joint objective L = L_ITC + L_ITM + L_MLM + L_Prefix + L_PTA using only 5% object annotations, enabling scalable pretraining on 4M image-text pairs and achieving an 88% speedup with competitive or superior downstream performance. By enabling higher-resolution finetuning and providing strong results across VQA, captioning, retrieval, and grounding, COPA demonstrates a practical pathway to scalable, efficient, text-guided vision-language modeling without heavy object detectors. The method also shows potential in extending to single-stream architectures, preserving speedups while offering flexibility for various VL tasks.
Abstract
Vision-Language Pre-training (VLP) methods based on object detection enjoy the rich knowledge of fine-grained object-text alignment but at the cost of computationally expensive inference. Recent Visual-Transformer (ViT)-based approaches circumvent this issue while struggling with long visual sequences without detailed cross-modal alignment information. This paper introduces a ViT-based VLP technique that efficiently incorporates object information through a novel patch-text alignment mechanism. Specifically, we convert object-level signals into patch-level ones and devise a Patch-Text Alignment pre-training task (PTA) to learn a text-aware patch detector. By using off-the-shelf delicate object annotations in 5\% training images, we jointly train PTA with other conventional VLP objectives in an end-to-end manner, bypassing the high computational cost of object detection and yielding an effective patch detector that accurately detects text-relevant patches, thus considerably reducing patch sequences and accelerating computation within the ViT backbone. Our experiments on a variety of widely-used benchmarks reveal that our method achieves a speedup of nearly 88\% compared to prior VLP models while maintaining competitive or superior performance on downstream tasks with similar model size and data scale.
