Table of Contents
Fetching ...

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Hashemi, Zhanghexuan Li, Mihir Chauhan, Yan Shen, Abhishek Satbhai, Mir Basheer Ali, Mingchen Gao, Sargur Srihari

TL;DR

LAViTeR, a novel architecture for visual and textual representation learning, is proposed and the experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual andtextual representation alignment in the joint feature embedding space.

Abstract

Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

TL;DR

LAViTeR, a novel architecture for visual and textual representation learning, is proposed and the experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual andtextual representation alignment in the joint feature embedding space.

Abstract

Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space

Paper Structure

This paper contains 20 sections, 17 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An overview of the end-to-end LAViTeR network. VTA module is assisted by ITM and TIM modules, which in-turn learns to better align the corresponding visual and textual counterparts. The bidirectional arrows indicate the alignment between words and their respective objects in the given image. The intra-word arrows indicate the relationships between the input words that the network learns.
  • Figure 2: The architecture of the proposed LAViTeR. The pipelines with dotted outlines are the two assisting tasks, namely image to text and text to image conversion. Feature vectors of real image regions are indicated by $r$ while $v$ denotes the global image feature vector. Real text sentence level feature vector is indicated by $s$ while $w$ denotes the word level feature vectors. Similarly all $\hat{w},\hat{s},\hat{r},\hat{v}$ indicates the features extracted from generated samples. $L$ stands for various losses. Dotted arrows indicate the vectors that contribute the loss. Solid arrows indicate the vectors are input to the subsequent network.
  • Figure 3: A T-SNE maaten2008tsne visualization of 3200 image representations and 32 textual-label representations from LAViTeRcocoeval.
  • Figure 4: The top-5 image-to-text matching captions with descending similarity scores. Blue captions are the correct matches, while red ones are incorrect matches.
  • Figure 5: The top-3 text-to-image matching images with descending similarity scores from left to right. Green marks are the correct matches, while red crosses are incorrect matches.
  • ...and 5 more figures