Table of Contents
Fetching ...

A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering

Georgios Sidiropoulos, Evangelos Kanoulas

TL;DR

The experimental results showed that the proposed ASR-free, end-to-end trained multimodal dense retriever is a promising alternative to the \textit{ASR and Retriever} pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.

Abstract

Speech-based open-domain question answering (QA over a large corpus of text passages with spoken questions) has emerged as an important task due to the increasing number of users interacting with QA systems via speech interfaces. Passage retrieval is a key task in speech-based open-domain QA. So far, previous works adopted pipelines consisting of an automatic speech recognition (ASR) model that transcribes the spoken question before feeding it to a dense text retriever. Such pipelines have several limitations. The need for an ASR model limits the applicability to low-resource languages and specialized domains with no annotated speech data. Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. Our experimental results showed that, on shorter questions, our retriever is a promising alternative to the \textit{ASR and Retriever} pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.

A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering

TL;DR

The experimental results showed that the proposed ASR-free, end-to-end trained multimodal dense retriever is a promising alternative to the \textit{ASR and Retriever} pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.

Abstract

Speech-based open-domain question answering (QA over a large corpus of text passages with spoken questions) has emerged as an important task due to the increasing number of users interacting with QA systems via speech interfaces. Passage retrieval is a key task in speech-based open-domain QA. So far, previous works adopted pipelines consisting of an automatic speech recognition (ASR) model that transcribes the spoken question before feeding it to a dense text retriever. Such pipelines have several limitations. The need for an ASR model limits the applicability to low-resource languages and specialized domains with no annotated speech data. Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. Our experimental results showed that, on shorter questions, our retriever is a promising alternative to the \textit{ASR and Retriever} pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.
Paper Structure (17 sections, 6 equations, 3 figures, 5 tables)

This paper contains 17 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our multimodal dense retriever.
  • Figure 2: Retrieval performance w.r.t the WER of questions; on Spoken-MSMARCO (dev). Each bin with a non-zero WER has ${\sim}1300$ samples, while the one with a zero WER has ${\sim}1600$ samples. We also report the average question length, in tokens, per bin.
  • Figure 3: Retrieval results w.r.t the relevant importance of the mistranscribed words; on Spoken-MSMARCO (dev). For questions with multiple mistranscribed words, we use the word with the highest relevant importance to assign the question to a bin. Bins have ${\sim}1000$ samples.