Table of Contents
Fetching ...

DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension

Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, Karthik Sankaranarayanan

TL;DR

DuoRC presents a large-scale RC dataset built from parallel movie plots (Wikipedia vs IMDb) where questions are generated from one version and answers drawn from the other, creating low lexical overlap and necessitating external knowledge, coreference and multi-sentence inference. The authors establish baselines with SpanModel (BiDAF) and GenModel (span prediction plus abstractive generation) and demonstrate that ParaphraseRC is substantially harder than SelfRC, with preprocessing and data augmentation offering limited gains. The work shows that existing SQuAD-style models perform poorly on this dataset, highlighting new research directions for narrative reasoning, unanswerability detection, and cross-version paraphrase understanding. As a complementary benchmark, DuoRC aims to drive progress toward more robust, knowledge-enabled QA systems capable of complex language understanding.

Abstract

We propose DuoRC, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets. DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie - one from Wikipedia and the other from IMDb - written by two different authors. We asked crowdsourced workers to create questions from one version of the plot and a different set of workers to extract or synthesize answers from the other version. This unique characteristic of DuoRC where questions and answers are created from different versions of a document narrating the same underlying story, ensures by design, that there is very little lexical overlap between the questions created from one version and the segments containing the answer in the other version. Further, since the two versions have different levels of plot detail, narration style, vocabulary, etc., answering questions from the second version requires deeper language understanding and incorporating external background knowledge. Additionally, the narrative style of passages arising from movie plots (as opposed to typical descriptive passages in existing datasets) exhibits the need to perform complex reasoning over events across multiple sentences. Indeed, we observe that state-of-the-art neural RC models which have achieved near human performance on the SQuAD dataset, even when coupled with traditional NLP techniques to address the challenges presented in DuoRC exhibit very poor performance (F1 score of 37.42% on DuoRC v/s 86% on SQuAD dataset). This opens up several interesting research avenues wherein DuoRC could complement other RC datasets to explore novel neural approaches for studying language understanding.

DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension

TL;DR

DuoRC presents a large-scale RC dataset built from parallel movie plots (Wikipedia vs IMDb) where questions are generated from one version and answers drawn from the other, creating low lexical overlap and necessitating external knowledge, coreference and multi-sentence inference. The authors establish baselines with SpanModel (BiDAF) and GenModel (span prediction plus abstractive generation) and demonstrate that ParaphraseRC is substantially harder than SelfRC, with preprocessing and data augmentation offering limited gains. The work shows that existing SQuAD-style models perform poorly on this dataset, highlighting new research directions for narrative reasoning, unanswerability detection, and cross-version paraphrase understanding. As a complementary benchmark, DuoRC aims to drive progress toward more robust, knowledge-enabled QA systems capable of complex language understanding.

Abstract

We propose DuoRC, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets. DuoRC contains 186,089 unique question-answer pairs created from a collection of 7680 pairs of movie plots where each pair in the collection reflects two versions of the same movie - one from Wikipedia and the other from IMDb - written by two different authors. We asked crowdsourced workers to create questions from one version of the plot and a different set of workers to extract or synthesize answers from the other version. This unique characteristic of DuoRC where questions and answers are created from different versions of a document narrating the same underlying story, ensures by design, that there is very little lexical overlap between the questions created from one version and the segments containing the answer in the other version. Further, since the two versions have different levels of plot detail, narration style, vocabulary, etc., answering questions from the second version requires deeper language understanding and incorporating external background knowledge. Additionally, the narrative style of passages arising from movie plots (as opposed to typical descriptive passages in existing datasets) exhibits the need to perform complex reasoning over events across multiple sentences. Indeed, we observe that state-of-the-art neural RC models which have achieved near human performance on the SQuAD dataset, even when coupled with traditional NLP techniques to address the challenges presented in DuoRC exhibit very poor performance (F1 score of 37.42% on DuoRC v/s 86% on SQuAD dataset). This opens up several interesting research avenues wherein DuoRC could complement other RC datasets to explore novel neural approaches for studying language understanding.

Paper Structure

This paper contains 21 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Example QA pairs obtained from the original movie plot and the paraphrased plot. The relevant spans needed for answering the corresponding question are highlighted in blue and red with the respective question numbers. Note that the span highlighting shown here is for illustrative purposes only and is not available in the dataset.
  • Figure 2: Analysis of the Question Types
  • Figure 3: Manual Analysis of 100 Questions and their corresponding answers from the SelfRC and ParaphraseRC Dataset to understand the various reasons behind these two answers being different or the latter being non-answerable
  • Figure 4: Model architecture
  • Figure 5: Performance Analysis of the Self and ParaphraseRC on different plot-lengths or different question-types