Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang; Bolin Chen; Yuejie Li; Yueying Hua; Jianhao Nie; Yueping He; Bowen Li; Chengjun Mao

Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.

Attention-guided Evidence Grounding for Spoken Question Answering

Abstract

Paper Structure (25 sections, 10 equations, 5 figures, 5 tables)

This paper contains 25 sections, 10 equations, 5 figures, 5 tables.

Introduction
Related Work
Methodology
Task Definition
Overall Framework
Grounding with Attention
Attention Weight Extraction
Key Evidence Grounding
Learning to Focus on Evidence
Experiment
Environment Setup
Datasets and Metrics
Baselines and Base Model
Implementation Details
Main Results
...and 10 more sections

Figures (5)

Figure 1: A comparison demonstrating the critical role of Learning to Focus on Evidence (LFE). For the audio query "When was the Governor ended?", our complete AEG framework successfully grounds the answer in the correct evidence (Doc 3), while both AEG without LFE and baseline methods fail to identify relevant evidence, resulting in incorrect responses.
Figure 2: Overview of the proposed Attention-guided Evidence Grounding (AEG) method. AEG comprises two components: (1) Learning to Focus on Evidence—a supervised fine-tuning stage that calibrates the SpeechLLM’s attention toward key evidence, and (2) Grounding with Attention—an inference stage that leverages learned attention patterns to highlight and ground key evidence.
Figure 3: Heatmaps of attention weight evolution across layers (y-axis, 1-32) and training steps (x-axis, 0-190) in Qwen2-Audio. (a) Weights allocated to key evidence. (b) Weights assigned to irrelevant evidence. (c) The difference (diff) between key and irrelevant weights. Green boxes (10-28) highlight the most effective layers.
Figure 4: Impact of the grounding threshold $\tau$ on evidence selection performance ($\text{F}_1$), comparing the baseline (base) and LFE (sft) models.
Figure 5: Impact of the grounding threshold $\tau$ on evidence selection performance, including Hit Rate, Precision, Recall, and $F_1$ score, comparing the baseline (base) and LFE (sft) models.

Attention-guided Evidence Grounding for Spoken Question Answering

Abstract

Attention-guided Evidence Grounding for Spoken Question Answering

Authors

Abstract

Table of Contents

Figures (5)