Table of Contents
Fetching ...

Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Attention-guided Evidence Grounding for Spoken Question Answering

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
Paper Structure (25 sections, 10 equations, 5 figures, 5 tables)

This paper contains 25 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison demonstrating the critical role of Learning to Focus on Evidence (LFE). For the audio query "When was the Governor ended?", our complete AEG framework successfully grounds the answer in the correct evidence (Doc 3), while both AEG without LFE and baseline methods fail to identify relevant evidence, resulting in incorrect responses.
  • Figure 2: Overview of the proposed Attention-guided Evidence Grounding (AEG) method. AEG comprises two components: (1) Learning to Focus on Evidence—a supervised fine-tuning stage that calibrates the SpeechLLM’s attention toward key evidence, and (2) Grounding with Attention—an inference stage that leverages learned attention patterns to highlight and ground key evidence.
  • Figure 3: Heatmaps of attention weight evolution across layers (y-axis, 1-32) and training steps (x-axis, 0-190) in Qwen2-Audio. (a) Weights allocated to key evidence. (b) Weights assigned to irrelevant evidence. (c) The difference (diff) between key and irrelevant weights. Green boxes (10-28) highlight the most effective layers.
  • Figure 4: Impact of the grounding threshold $\tau$ on evidence selection performance ($\text{F}_1$), comparing the baseline (base) and LFE (sft) models.
  • Figure 5: Impact of the grounding threshold $\tau$ on evidence selection performance, including Hit Rate, Precision, Recall, and $F_1$ score, comparing the baseline (base) and LFE (sft) models.