Table of Contents
Fetching ...

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Pierre Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi, Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, Brian Belgodere

TL;DR

This work details the theory and engineering from the winning submission to the 2020 captioning competition, and provides a step towards improved assistive image captioning systems.

Abstract

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

TL;DR

This work details the theory and engineering from the winning submission to the 2020 captioning competition, and provides a step towards improved assistive image captioning systems.

Abstract

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.

Paper Structure

This paper contains 21 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of images in VizWiz (goal or task oriented captions) and COCO dataset (generic captions). For VizWiz images, along with the caption, we note any quality issues that are present in the image, such as blur, need for rotation, and occlusions.
  • Figure 2: Multimodal Assistive Captioner Overview. An input image is passed through 3 separate modules to perform image feature extraction, OCR and angle correction, and object detection. Their outputs are combined and submitted to a Multimodal Transformer Network. The generated caption is further biased towards the OCR and object detection outputs by means of copy mechanisms to produce more goal-oriented content, rather than a generic descriptive caption.
  • Figure 3: Multimodal Assistive Captioner. Our captioner is a multimodal Transformer with an encoder/decoder architecture composed of 4 distinct stages. First, an Encoding stage provides embeddings for the 3 modality streams of image features, object detection, and OCR results derived from an input image. Image features are extracted by a ResNeXt model, while object detection and OCR modules generate sequences of object categories (10 max) and strings of characters (20 max) respectively, which are then embedded using fastText. Embeddings for these 3 modalities are combined and fed to a Decoding stage where a decoder generates a caption whose content is biased towards OCR and object detection results by a copy mechanism that influences the Loss computation stage. The Post Processing stage finalizes the generated caption by cleaning up the caption, maximizing the use of OCR tokens, and promoting coherent candidate (via self-BERT score). The OCR module is expanded to show the inner workings of our Dictionary-Guided Rotation-Invariant OCR module. The input image rotations lead to 4 sets of OCR results which enable the selection of a proper rotation based on the largest number of intelligible tokens.
  • Figure 4: OCR example with two competing rotations. In this case, OCR detects tokens for two rotations, and after dictionary-guided ranking the tokens associated with $90^{\circ}$ rotation are selected.
  • Figure 5: OCR example with four competing rotations. In this case, OCR detects tokens for all four rotations, and after dictionary-guided ranking the tokens associated with $0^{\circ}$ rotation are selected.
  • ...and 5 more figures