Table of Contents
Fetching ...

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu

TL;DR

SOHO addresses the limitation of region-based visual features in vision-language pre-training by proposing an end-to-end framework that processes whole images using a trainable visual encoder and a dynamic Visual Dictionary (VD) to produce compact visual tokens. It introduces Masked Visual Modeling (MVM) alongside Masked Language Modeling (MLM) and Image-Text Matching (ITM) to align visual and textual modalities within a multi-layer Transformer, with VD embeddings updated via a moving-average scheme. Trained on in-domain data from MSCOCO and Visual Genome with equal-preference objective weights, SOHO achieves consistent improvements across image-text retrieval, VQA, NLVR^2, and SNLI-VE, and delivers roughly a 10x faster inference time than region-based methods. The approach reduces labeling costs by removing bounding-box annotations and offers practical impact for real-time, scalable vision-language applications.

Abstract

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

TL;DR

SOHO addresses the limitation of region-based visual features in vision-language pre-training by proposing an end-to-end framework that processes whole images using a trainable visual encoder and a dynamic Visual Dictionary (VD) to produce compact visual tokens. It introduces Masked Visual Modeling (MVM) alongside Masked Language Modeling (MLM) and Image-Text Matching (ITM) to align visual and textual modalities within a multi-layer Transformer, with VD embeddings updated via a moving-average scheme. Trained on in-domain data from MSCOCO and Visual Genome with equal-preference objective weights, SOHO achieves consistent improvements across image-text retrieval, VQA, NLVR^2, and SNLI-VE, and delivers roughly a 10x faster inference time than region-based methods. The approach reduces labeling costs by removing bounding-box annotations and offers practical impact for real-time, scalable vision-language applications.

Abstract

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

Paper Structure

This paper contains 25 sections, 9 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Comparisons of SOHO and region-based methods by top-1 image-to-text retrieval (TR) and visual question answering (VQA) results. Baselines lack global context and fail to understand the image. SOHO discovers visual clues out of region boxes and infers correct human activities. [Best viewed in color.]
  • Figure 2: The framework of the proposed end-to-end pre-training model SOHO. For an input text (a), we use the text embedding operation (b) to extract the textual embedding features. For an input image (d), we propose to use a trainable CNN-based encoder (e) to extract visual representations. To further transform image features to consistent semantics, we apply a visual dictionary-based image embedding (f) to the image encoder outputs. Finally, we apply multi-layer Transformers to the output of multi-modal concatenation (c) with three pre-training tasks. Note that the index matrix in (f) will be used as labels in the masked VM task in (g). [Best viewed in color.]
  • Figure 3: Visualization of VD. The left and right indices reflect the semantic of "head" and "building" with consistent visual patterns, respectively.
  • Figure 4: Visualization of visual dictionary (VD) we have learned by SOHO. Apart from the two indices we have shown in the paper, we randomly select another ten indices in the visual dictionary to present in this supplementary material. From the above results we can find that, our visual dictionary is learned to group meaningful and consistent semantics of image patches into different indices. Thus, each index can reflect an abstraction of visual semantics. [Best viewed in color.]
  • Figure :