Table of Contents
Fetching ...

GSQA: An End-to-End Model for Generative Spoken Question Answering

Min-Han Shih, Ho-Lam Chung, Yu-Chi Pai, Ming-Hao Hsu, Guan-Ting Lin, Shang-Wen Li, Hung-yi Lee

TL;DR

The first end-to-end Generative Spoken Question Answering (GSQA) model is introduced that empowers the system to engage in abstractive reasoning and shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA.

Abstract

In recent advancements in spoken question answering (QA), end-to-end models have made significant strides. However, previous research has primarily focused on extractive span selection. While this extractive-based approach is effective when answers are present directly within the input, it falls short in addressing abstractive questions, where answers are not directly extracted but inferred from the given information. To bridge this gap, we introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning. The challenge in training our GSQA model lies in the absence of a spoken abstractive QA dataset. We propose using text models for initialization and leveraging the extractive QA dataset to transfer knowledge from the text generative model to the spoken generative model. Experimental results indicate that our model surpasses the previous extractive model by 3% on extractive QA datasets. Furthermore, the GSQA model has only been fine-tuned on the spoken extractive QA dataset. Despite not having seen any spoken abstractive QA data, it can still closely match the performance of the cascade model. In conclusion, our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA. Our code is available at https://voidful.github.io/GSQA

GSQA: An End-to-End Model for Generative Spoken Question Answering

TL;DR

The first end-to-end Generative Spoken Question Answering (GSQA) model is introduced that empowers the system to engage in abstractive reasoning and shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA.

Abstract

In recent advancements in spoken question answering (QA), end-to-end models have made significant strides. However, previous research has primarily focused on extractive span selection. While this extractive-based approach is effective when answers are present directly within the input, it falls short in addressing abstractive questions, where answers are not directly extracted but inferred from the given information. To bridge this gap, we introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning. The challenge in training our GSQA model lies in the absence of a spoken abstractive QA dataset. We propose using text models for initialization and leveraging the extractive QA dataset to transfer knowledge from the text generative model to the spoken generative model. Experimental results indicate that our model surpasses the previous extractive model by 3% on extractive QA datasets. Furthermore, the GSQA model has only been fine-tuned on the spoken extractive QA dataset. Despite not having seen any spoken abstractive QA data, it can still closely match the performance of the cascade model. In conclusion, our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA. Our code is available at https://voidful.github.io/GSQA
Paper Structure (18 sections, 3 figures, 4 tables)

This paper contains 18 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: GSQA compared to other baselines: The Cascade Method accommodates both abstractive and extractive QA but risks error propagation. DUAL is an end-to-end textless approach, exclusive to extractive QA. GSQA is a textless, end-to-end generative method, capable of handling both extractive and abstractive QA.
  • Figure 2: Left: The process of discrete unit quantization from synthesis data. Right: Model Training Procedure: A depiction of the transition from textual QA pretraining to spoken QA fine-tuning.
  • Figure 3: We sample the 8000 data within NMSQA-dev to verify the impact on the cascaded model under different Word Error Rates (WERs) with different ASR systems.