Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Miriam Winkler; Verena Blaschke; Barbara Plank

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Miriam Winkler, Verena Blaschke, Barbara Plank

Abstract

Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Abstract

Paper Structure (44 sections, 7 figures, 10 tables)

This paper contains 44 sections, 7 figures, 10 tables.

Introduction
Contributions
Related Work
IQA resources are good, but sparse.
Indirectness is diverse.
InQA+: Hand-Curated Test Dataset
Annotations and label definitions
Label set variations
Inter-annotator agreement
Gen-IQA: Artificial Training Dataset
Dataset variations
Bavarian Gen-IQA language quality
Labelling accuracy
Experimental Setups
Results and Analysis
...and 29 more sections

Figures (7)

Figure 1: Confusion matrices between two annotators (top left) and the Gen-IQA labels as originally generated vs. re-annotated by the main annotator, respectively on 100 sentences from each dataset. Cond. = Conditional Yes; Neith. = Neither Yes nor No; Lack. = Lacking Context.
Figure 2: Average accuracy scores per genre over three seeds of mBERT models, evaluation on InQA+.
Figure 3: Average accuracy scores per genre over three seeds of mBERT models, evaluation on InQA+ yes-no.
Figure 4: Personal disclosures of the participants in the dialect quality survey.
Figure 5: Participant origin regions of the dialect quality survey: Bavaria (per administrative region) and Austria.
...and 2 more figures

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Abstract

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Authors

Abstract

Table of Contents

Figures (7)