Machine Reading Comprehension using Case-based Reasoning

Dung Thai; Dhruv Agarwal; Mudit Chaudhary; Wenlong Zhao; Rajarshi Das; Manzil Zaheer; Jay-Yoon Lee; Hannaneh Hajishirzi; Andrew McCallum

Machine Reading Comprehension using Case-based Reasoning

Dung Thai, Dhruv Agarwal, Mudit Chaudhary, Wenlong Zhao, Rajarshi Das, Manzil Zaheer, Jay-Yoon Lee, Hannaneh Hajishirzi, Andrew McCallum

TL;DR

CBR-MRC addresses the interpretability gap in extractive machine reading comprehension by adopting a semi-parametric, case-based reasoning framework. It retrieves similar question contexts, reuses their answer representations, and scores candidate spans in the target context via explicit similarity to retrieved cases, enabling evidence attribution for predictions. The approach achieves state-of-the-art EM on NaturalQuestions and NewsQA, demonstrates strong evidence identification, and shows robust performance under lexical diversity and in few-shot domain adaptation. This work highlights the practical benefits of grounding QA in retrieved evidence, offering a reliable path toward transparent and adaptable QA systems in real-world settings.

Abstract

We present an accurate and interpretable method for answer extraction in machine reading comprehension that is reminiscent of case-based reasoning (CBR) from classical AI. Our method (CBR-MRC) builds upon the hypothesis that contextualized answers to similar questions share semantic similarities with each other. Given a test question, CBR-MRC first retrieves a set of similar cases from a nonparametric memory and then predicts an answer by selecting the span in the test context that is most similar to the contextualized representations of answers in the retrieved cases. The semi-parametric nature of our approach allows it to attribute a prediction to the specific set of evidence cases, making it a desirable choice for building reliable and debuggable QA systems. We show that CBR-MRC provides high accuracy comparable with large reader models and outperforms baselines by 11.5 and 8.4 EM on NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability of CBR-MRC in identifying not just the correct answer tokens but also the span with the most relevant supporting evidence. Lastly, we observe that contexts for certain question types show higher lexical diversity than others and find that CBR-MRC is robust to these variations while performance using fully-parametric methods drops.

Machine Reading Comprehension using Case-based Reasoning

TL;DR

Abstract

Paper Structure (32 sections, 2 equations, 4 figures, 8 tables)

This paper contains 32 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Work
Machine Reading Comprehension.
Case-based Reasoning.
In-Context Learning (ICL).
Method
Case Retrieval
Case retrieval during training.
Case Reuse
Training
Experiments
Datasets
Experiment Setup
Baselines.
Evaluation Metrics.
...and 17 more sections

Figures (4)

Figure 1: Overview of Cbr-Mrc.Top: given a test question, Cbr-Mrc first retrieves similar cases from a memory of known (question, context, answer) triples. Bottom:Cbr-Mrc scores each candidate answer span by comparing its contextualized representation with the answer spans of the retrieved cases to output a score. The candidate with the highest aggregate score with the answer spans from the cases is output as the prediction ("Charles Babbage").
Figure 2: Inference with Cbr-Mrc. For a given target query, (1) other similar queries (and their contexts) are retrieved; (2) candidate answer spans are extracted from the target context; (3) the candidate spans in the target context are ranked w.r.t. answer spans of the retrieved queries by comparing their contextualized embeddings. Finally, the span with the highest inner-product similarity is selected as the prediction. Note that the casebase questions and their corresponding answer span embeddings are pre-computed and cached.
Figure 3: Robustness to lexical diversity in passages. Lexical diversity of latent relation clusters is the number of unique tokens in passages seen at training for that latent relation cluster (or question type). We plot F1 performance (averaged over 6 clusterings varying cut thresholds) on 8 buckets of increasing lexical diversity for the latent relations seen at training. Cbr-Mrc shows a drop in performance only up to 4.10 points, while BLANC and MADE show drops of up to 15.38 and 11.80 points, respectively.
Figure 4: F1 performance v/s lexical diversity in passage contexts on clusterings with decreasing levels of cluster-tightness. We cluster questions by latent relations in the training set using HAC using 6 cut thresholds and assign each test question to one of these train clusters. We then bucket the clusters by lexical diversity scores and compute F1 performance of each bucket for each of the 6 clusterings shown.

Machine Reading Comprehension using Case-based Reasoning

TL;DR

Abstract

Machine Reading Comprehension using Case-based Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)