Table of Contents
Fetching ...

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai

TL;DR

Negative Augmented Samples (NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding, is introduced.

Abstract

Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

TL;DR

Negative Augmented Samples (NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding, is introduced.

Abstract

Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Fine-grained enhanced VLP architecture. NTA constructs the hard negative text samples for the language modality(a); We discrete the visual representation and construct the hard negative image samples for the visual modality(b); FGITM is proposed to leverage the fine-grained negative image and text samples to enhance the fine-grained capability(c).
  • Figure 2: Comparison of our fine-grained NAS to other VL frameworks. Mainstream VLP methods utilize two "Dual Tower" encoders and use a multi-modal encoder for deep fusion of multi-modal features(e.g., ALBEF albef and METER dou2022empirical)(a), NTA-based methods construct augmented negative text samples to enhance VLP model's fine-grained ability with FGITM(e.g., VL-Match bi2023vl and ViLTA wang2023vilta)(b), our NAS introduces a NVA module to construct augmented negative image features, together with NTVA to enhance the VLP fine-grained capability with FGITM in an end-to-end manner(c).
  • Figure 3: (a) The framework of the proposed end-to-end pretraining model NAS. (b) Illustration of our NVA. The continuous visual embedding encoded by the image encoder is firstly quantified into discrete embedding and then identifies the object in the image embedding based on the similarity between the global [CLS] embeddings and local discrete embeddings. We use the object embedding to search top-k neighbors in the dictionary and replace them with the neighbor tokens to construct negative image samples. [Best viewed in color.]
  • Figure 4: Cases on the VALSE benchmark. The first, second and third rows are the results on the existence, counting and actions(actant swap) test respectively. More examples are seen in Supplementary.
  • Figure 5: Examples on the VALSE benchmark. Sequentially from top to bottom, the panels display results for the existence test, counting test, and actions(actant swap) test respectively.
  • ...and 1 more figures