Table of Contents
Fetching ...

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Chaoya Jiang, Wei ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

TL;DR

The paper tackles data inefficiency in Vision-Language Pre-training caused by noisy web image-text pairs by introducing TiMix, a text-aware image mixing strategy. TiMix blends images guided by patch-text relevance via a Patch Text Alignment task and a Text-aware Patch Predictor, producing mixed samples for cross-modal contrastive learning and deriving soft-labels for the text blocks. The authors provide a mutual information perspective showing mixed samples act as a regularizer for the InfoNCE objective, and empirically validate TiMix with ALBEF-TiMix and mPLUG-TiMix across VQA, NLVR2, image-text retrieval, image captioning, and visual grounding, achieving data-efficient gains with modest computational overhead. The approach improves cross-modal alignment and accelerates convergence, enabling competitive performance with substantially less pre-training data and time, which broadens practical deployment of VLP models.

Abstract

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

TL;DR

The paper tackles data inefficiency in Vision-Language Pre-training caused by noisy web image-text pairs by introducing TiMix, a text-aware image mixing strategy. TiMix blends images guided by patch-text relevance via a Patch Text Alignment task and a Text-aware Patch Predictor, producing mixed samples for cross-modal contrastive learning and deriving soft-labels for the text blocks. The authors provide a mutual information perspective showing mixed samples act as a regularizer for the InfoNCE objective, and empirically validate TiMix with ALBEF-TiMix and mPLUG-TiMix across VQA, NLVR2, image-text retrieval, image captioning, and visual grounding, achieving data-efficient gains with modest computational overhead. The approach improves cross-modal alignment and accelerates convergence, enabling competitive performance with substantially less pre-training data and time, which broadens practical deployment of VLP models.

Abstract

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.
Paper Structure (33 sections, 22 equations, 9 figures, 8 tables)

This paper contains 33 sections, 22 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Subfigure (a) illustrates the Visual Question Answering results and pre-training time per epoch of the VLP model mPLUG li2022mplug which are pre-trained on with different data sizes on 8 $\times$ 80G A100. 4M+TiMix refers to training on 4M data with TiMix. Subfigure (b) illustrates the convergence curve of cross-modal contrastive learning, the x-axis is labeled as epoch.
  • Figure 2: The subfigure (a) illustrates the process of TiMix, where two image-text pairs are utilized. Subfigure (b) depicts the architecture of the Text-aware patch predictor.
  • Figure 3: An example of TiMix in image-to-text contrastive learning. The text within the green box represents the positive samples, and the text within the gray box represents the negative samples.
  • Figure 4: The visualization of VQA accuracy and Pre-training time per epoch of different models pre-trained on different data sizes
  • Figure 5: The visualization of Accuracy and Recall of TPP on the 10K test dataset randomly sampled from CC sharma2018conceptual.
  • ...and 4 more figures