Table of Contents
Fetching ...

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

Yiyi Tao, Zhuoyue Wang, Hang Zhang, Lun Wang

TL;DR

NEVLP addresses the challenge of noisy and incomplete web-crawled data in vision-language pre-training by proposing a noise-robust framework that uses frozen image encoders and LLMs bridged through a lightweight transformer. It introduces two key strategies: noise-adaptive learning, which estimates per-pair noise probabilities $\varepsilon_i$ and applies a corresponding regularization factor $\omega_i = \lambda \varepsilon_i$ to image-text contrastive learning, and concept-enhanced learning, which enriches incomplete text with visual concepts from a corpus $Q$ to improve cross-modal matching and image-grounded text generation. The framework optimizes three objectives—NITC, CITG, and CITM—via modality-specific attention masks, enabling efficient representation and generation while mitigating noise. Across image captioning, image-text retrieval, and VQA, NEVLP achieves state-of-the-art performance with substantially less pre-training data (42M images), demonstrating robust, data-efficient cross-modal learning and practical impact for scalable vision-language systems.

Abstract

The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training

TL;DR

NEVLP addresses the challenge of noisy and incomplete web-crawled data in vision-language pre-training by proposing a noise-robust framework that uses frozen image encoders and LLMs bridged through a lightweight transformer. It introduces two key strategies: noise-adaptive learning, which estimates per-pair noise probabilities and applies a corresponding regularization factor to image-text contrastive learning, and concept-enhanced learning, which enriches incomplete text with visual concepts from a corpus to improve cross-modal matching and image-grounded text generation. The framework optimizes three objectives—NITC, CITG, and CITM—via modality-specific attention masks, enabling efficient representation and generation while mitigating noise. Across image captioning, image-text retrieval, and VQA, NEVLP achieves state-of-the-art performance with substantially less pre-training data (42M images), demonstrating robust, data-efficient cross-modal learning and practical impact for scalable vision-language systems.

Abstract

The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.
Paper Structure (12 sections, 4 equations, 1 figure, 4 tables)

This paper contains 12 sections, 4 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: (Left) model architecture of NEVLP and the first stage vision language representation learning objectives. The framework consists of an image transformer for visual feature extraction and a text transformer that function as both a text encoder and a text decoder. Image transformer and text transformer share the same self-attention layer. We jointly optimize three objectives: concept-enhanced image-text matching, concept-enhanced image-ground text generation and noise-adpative image-text contrastive learning. (Right) The self-attention masking strategy for each objective to control query-text interaction.