Unsupervised multiple choices question answering via universal corpus
Qin Zhang, Hao Ge, Xiaojun Chen, Meng Fang
TL;DR
This work tackles unsupervised multiple-choice question answering in a fully non-annotated setting by leveraging a universal corpus. It proposes a two-stage pipeline: first generate QA pairs by extracting named entities as answers and producing cloze-style questions via an unsupervised translation model, then create high-quality distractors using NE, KG, or a KG-NE hybrid with ConceptNet. Experiments on four benchmark MCQA datasets show that NE-based and especially KG-NE distractor strategies yield strong performance when models are trained on synthetic data, with Llama 2-7B benefiting significantly from fine-tuning on this corpus. The findings highlight the critical role of distractor quality and distribution across question types, demonstrating a practical path to deploy MCQA systems in new domains without labeled data.
Abstract
Unsupervised question answering is a promising yet challenging task, which alleviates the burden of building large-scale annotated data in a new domain. It motivates us to study the unsupervised multiple-choice question answering (MCQA) problem. In this paper, we propose a novel framework designed to generate synthetic MCQA data barely based on contexts from the universal domain without relying on any form of manual annotation. Possible answers are extracted and used to produce related questions, then we leverage both named entities (NE) and knowledge graphs to discover plausible distractors to form complete synthetic samples. Experiments on multiple MCQA datasets demonstrate the effectiveness of our method.
