Table of Contents
Fetching ...

Unsupervised multiple choices question answering via universal corpus

Qin Zhang, Hao Ge, Xiaojun Chen, Meng Fang

TL;DR

This work tackles unsupervised multiple-choice question answering in a fully non-annotated setting by leveraging a universal corpus. It proposes a two-stage pipeline: first generate QA pairs by extracting named entities as answers and producing cloze-style questions via an unsupervised translation model, then create high-quality distractors using NE, KG, or a KG-NE hybrid with ConceptNet. Experiments on four benchmark MCQA datasets show that NE-based and especially KG-NE distractor strategies yield strong performance when models are trained on synthetic data, with Llama 2-7B benefiting significantly from fine-tuning on this corpus. The findings highlight the critical role of distractor quality and distribution across question types, demonstrating a practical path to deploy MCQA systems in new domains without labeled data.

Abstract

Unsupervised question answering is a promising yet challenging task, which alleviates the burden of building large-scale annotated data in a new domain. It motivates us to study the unsupervised multiple-choice question answering (MCQA) problem. In this paper, we propose a novel framework designed to generate synthetic MCQA data barely based on contexts from the universal domain without relying on any form of manual annotation. Possible answers are extracted and used to produce related questions, then we leverage both named entities (NE) and knowledge graphs to discover plausible distractors to form complete synthetic samples. Experiments on multiple MCQA datasets demonstrate the effectiveness of our method.

Unsupervised multiple choices question answering via universal corpus

TL;DR

This work tackles unsupervised multiple-choice question answering in a fully non-annotated setting by leveraging a universal corpus. It proposes a two-stage pipeline: first generate QA pairs by extracting named entities as answers and producing cloze-style questions via an unsupervised translation model, then create high-quality distractors using NE, KG, or a KG-NE hybrid with ConceptNet. Experiments on four benchmark MCQA datasets show that NE-based and especially KG-NE distractor strategies yield strong performance when models are trained on synthetic data, with Llama 2-7B benefiting significantly from fine-tuning on this corpus. The findings highlight the critical role of distractor quality and distribution across question types, demonstrating a practical path to deploy MCQA systems in new domains without labeled data.

Abstract

Unsupervised question answering is a promising yet challenging task, which alleviates the burden of building large-scale annotated data in a new domain. It motivates us to study the unsupervised multiple-choice question answering (MCQA) problem. In this paper, we propose a novel framework designed to generate synthetic MCQA data barely based on contexts from the universal domain without relying on any form of manual annotation. Possible answers are extracted and used to produce related questions, then we leverage both named entities (NE) and knowledge graphs to discover plausible distractors to form complete synthetic samples. Experiments on multiple MCQA datasets demonstrate the effectiveness of our method.
Paper Structure (10 sections, 1 figure, 4 tables)

This paper contains 10 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: An overview of our method. In the first stage, we extract the answers aa from the context cc, then generate their corresponding questions qq. In the second stage, we use a hybrid method, KG-NE, to generate distractors, thus building the answer candidate set $\mathcal{C}$.