Table of Contents
Fetching ...

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, Hai Zhao

TL;DR

The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability, and not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.

Abstract

The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

TL;DR

The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability, and not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.

Abstract

The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.
Paper Structure (19 sections, 22 equations, 6 figures, 6 tables)

This paper contains 19 sections, 22 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between text hotwords and the vision hotwords proposed in this paper. Text hotwords are a set of custom keywords that are prone to errors, while image hotwords refer to patches of an image. The hotword with a darker rectangle indicates that it is more relevant to transcription.
  • Figure 2: The structure of our proposed model, VHASR. The green dashed box contains the modules of the ASR stream, while the blue dashed box contains the modules of the VH stream. The data flow in the ASR part is indicated by green and red lines. It only passes through the red lines during ASR model's second pass of training. The VH stream's data flow is denoted by blue lines. The data flow for calculating audio-image similarity is represented by yellow lines. The purple lines illustrate the data flow when merging two streams.
  • Figure 3: Using vision hotword-audio similitude and image-audio similitude to learn fine visual representation.
  • Figure 4: The specific process of decoding optimization.
  • Figure 5: Three examples about how VH stream helps to rectify ASR stream's error.
  • ...and 1 more figures