Table of Contents
Fetching ...

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

Yang Yang

TL;DR

A novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences, and constrains the generated sentences from two aspects: inter-modal consistency and intra-modal consistency.

Abstract

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features. Existing approaches are mostly supervised ones, i.e., each image has a corresponding sentence in the training set. However, considering that describing images always requires a huge of manpower, we usually have limited amount of described images (i.e., image-text pairs) and a large number of undescribed images in real-world applications. Thereby, a dilemma is the "Semi-Supervised Image Captioning". To solve this problem, we propose a novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences. Different from traditional single modal semi-supervised methods, the difficulty of semi-supervised cross-modal learning lies in constructing intermediately comparable information among heterogeneous modalities. In this paper, SSIC-WGM adopts the successful scene graphs as intermediate information, and constrains the generated sentences from two aspects: 1) inter-modal consistency. SSIC-WGM constructs the scene graphs of the raw image and generated sentence respectively, then employs the wasserstein distance to better measure the similarity between region embeddings of different graphs. 2) intra-modal consistency. SSIC-WGM takes the data augmentation techniques for the raw images, then constrains the consistency among augmented images and generated sentences. Consequently, SSIC-WGM combines the cross-modal pseudo supervision and structure invariant measure for efficiently using the undescribed images, and learns more reasonable mapping function.

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

TL;DR

A novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences, and constrains the generated sentences from two aspects: inter-modal consistency and intra-modal consistency.

Abstract

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features. Existing approaches are mostly supervised ones, i.e., each image has a corresponding sentence in the training set. However, considering that describing images always requires a huge of manpower, we usually have limited amount of described images (i.e., image-text pairs) and a large number of undescribed images in real-world applications. Thereby, a dilemma is the "Semi-Supervised Image Captioning". To solve this problem, we propose a novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences. Different from traditional single modal semi-supervised methods, the difficulty of semi-supervised cross-modal learning lies in constructing intermediately comparable information among heterogeneous modalities. In this paper, SSIC-WGM adopts the successful scene graphs as intermediate information, and constrains the generated sentences from two aspects: 1) inter-modal consistency. SSIC-WGM constructs the scene graphs of the raw image and generated sentence respectively, then employs the wasserstein distance to better measure the similarity between region embeddings of different graphs. 2) intra-modal consistency. SSIC-WGM takes the data augmentation techniques for the raw images, then constrains the consistency among augmented images and generated sentences. Consequently, SSIC-WGM combines the cross-modal pseudo supervision and structure invariant measure for efficiently using the undescribed images, and learns more reasonable mapping function.
Paper Structure (24 sections, 9 equations, 7 figures, 4 tables)

This paper contains 24 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Diagram of the proposed SSIC-WGM. Top: the encoder-decoder model with semi-supervised loss. Bottom: our unsupervised loss, which qualify the generated sentences of undescribed images with intra-modal and inter-modal consistencies. In detail, two weakly-augmented images (i.e., ${\bf v}_1,{\bf v}_2$) and the raw image (i.e., ${\bf v}_0$) are fed into the encoder-decoder model to obtain corresponding sentences (i.e., ${\bf w}_0,{\bf w}_1,{\bf w}_2$). Inter-modal consistency calculates the distance between raw image's scene graph and generated sentence's scene graph, while intra-modal consistency constrains the distances between either two generated sentences'/images' scene graphs.
  • Figure 2: Effect analyses of Inter-Modal Consistency and Intra-Modal Consistency with Cross-Entropy Loss.
  • Figure 3: Relationship between captioning performance with different ratio of supervised data, XE denotes the results of cross-entropy loss and RL represents the results of CIDEr-D Score Optimization.
  • Figure 4: Relationship between captioning performance with different ratio of unsupervised data (CIDEr-D Score Optimization).
  • Figure 5: Examples of captions generated by SSIC-WGM and baseline models as well as the corresponding ground truths.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3