NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao, Zhuoyue Wang, Hang Zhang, Lun Wang
TL;DR
NEVLP addresses the challenge of noisy and incomplete web-crawled data in vision-language pre-training by proposing a noise-robust framework that uses frozen image encoders and LLMs bridged through a lightweight transformer. It introduces two key strategies: noise-adaptive learning, which estimates per-pair noise probabilities $\varepsilon_i$ and applies a corresponding regularization factor $\omega_i = \lambda \varepsilon_i$ to image-text contrastive learning, and concept-enhanced learning, which enriches incomplete text with visual concepts from a corpus $Q$ to improve cross-modal matching and image-grounded text generation. The framework optimizes three objectives—NITC, CITG, and CITM—via modality-specific attention masks, enabling efficient representation and generation while mitigating noise. Across image captioning, image-text retrieval, and VQA, NEVLP achieves state-of-the-art performance with substantially less pre-training data (42M images), demonstrating robust, data-efficient cross-modal learning and practical impact for scalable vision-language systems.
Abstract
The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.
