Table of Contents
Fetching ...

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)

Tim Hartill

TL;DR

A novel methods to show that the Reasoning Model is capable of answering contextualised questions without memorisation are proposed and a comprehensive set of baseline results on unseen evaluation datasets are established.

Abstract

Pretrained large Language Models (LLMs) are able to answer questions that are unlikely to have been encountered during training. However a diversity of potential applications exist in the broad domain of reasoning systems and considerations such as latency, cost, available compute resource and internet connectivity are relevant in determining an appropriate approach. We consider the setting where some local compute capacity is available at inference time but internet connectivity is not. Similar to a general-purpose LLM, we assume that our much smaller Reasoning Models may be asked arbitrary questions from unknown distributions, so we focus on evaluation in an unseen setting. We train our models to answer diverse questions by instilling an ability to reason over a retrieved context. We acquire context from two knowledge sources; a Wikipedia corpus queried using a multi-hop dense retrieval system with novel extensions, and from rationales generated from a larger Language Model optimised to run in a lower resource environment. Our main contributions: We propose novel methods to show that our model is capable of answering contextualised questions without memorisation. We establish a comprehensive set of baseline results on unseen evaluation datasets. We show that the addition of novel retrieval-augmented training datasets (RATD) to the training regime of the Reasoning Model significantly improves results. We demonstrate further significant improvement through the application of methods for combining knowledge from two sources. The first method (RR) involves training a novel Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We use the scores to derive combined contexts. We also show that utilising the RATD datasets enables our model to become proficient at utilising combined noisy contexts.

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)

TL;DR

A novel methods to show that the Reasoning Model is capable of answering contextualised questions without memorisation are proposed and a comprehensive set of baseline results on unseen evaluation datasets are established.

Abstract

Pretrained large Language Models (LLMs) are able to answer questions that are unlikely to have been encountered during training. However a diversity of potential applications exist in the broad domain of reasoning systems and considerations such as latency, cost, available compute resource and internet connectivity are relevant in determining an appropriate approach. We consider the setting where some local compute capacity is available at inference time but internet connectivity is not. Similar to a general-purpose LLM, we assume that our much smaller Reasoning Models may be asked arbitrary questions from unknown distributions, so we focus on evaluation in an unseen setting. We train our models to answer diverse questions by instilling an ability to reason over a retrieved context. We acquire context from two knowledge sources; a Wikipedia corpus queried using a multi-hop dense retrieval system with novel extensions, and from rationales generated from a larger Language Model optimised to run in a lower resource environment. Our main contributions: We propose novel methods to show that our model is capable of answering contextualised questions without memorisation. We establish a comprehensive set of baseline results on unseen evaluation datasets. We show that the addition of novel retrieval-augmented training datasets (RATD) to the training regime of the Reasoning Model significantly improves results. We demonstrate further significant improvement through the application of methods for combining knowledge from two sources. The first method (RR) involves training a novel Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We use the scores to derive combined contexts. We also show that utilising the RATD datasets enables our model to become proficient at utilising combined noisy contexts.

Paper Structure

This paper contains 97 sections, 6 equations, 4 figures, 34 tables.

Figures (4)

  • Figure 1: Visualisation of key aspects of our methods. We consider two models, one trained on a set of question-answering datasets (UQA) and the other trained on UQA plus two additional datasets collectively referred to as TDND (UQA+TDND). TDND samples are constructed so as to improve performance on some of our evaluation datasets and to be irrelevant for others. Our objective is to understand whether any improvement is attributable to memorisation or to TDND samples imparting an improved ability to generalise. We select evaluation samples that are very unlikely to have become memorisable from our training datasets based on a semantic similarity score (Section \ref{['sec:memo:sim_method']}), and compare performance between the two models. Our method enables evaluating performance for each model on the same subset of unmemorisable samples, and it does not require access to the pretraining corpus.
  • Figure 2: Major system components: The Iterator (green boxes) and Reasoning Model (blue box). An initial query for hop $t$=0 is input into the Retriever. The Reranker scores each of the retrieved $k$ paragraphs and constituent sentences. Top-$x$ sentences (Evidence Set$\leq$t) are selected from top-ranked sentences from the Reranker and from the prior hop Evidence Set$<$t. The query + Evidence Set$\leq$t are input into the Evidence Set Scorer which computes an overall Evidence Set Relevance Score $e$ and individual sentence relevance scores. Paragraphs associated with the top five sentences of Evidence Set$\leq$t are appended to the query and the process repeats tmax times. Finally, paragraph fragments recovered from the Evidence Set for hop t=$\mathop{\mathrm{arg\,max\xspace}}\nolimits(e)$ are concatenated with the original query and input into the Reasoning Model for answer generation.
  • Figure 3: Overview of our approach. Given an unseen question Q: [1] we acquire explanatory contexts, C1 and C2, from two knowledge sources. [2] We score the acquired contexts for relevance and truthfulness using a Rationale Ranking (RR) model that we train on diverse relevant/irrelevant samples that make both truthful and false assertions. [3] We evaluate and select methods for combining or filtering C1 and C2. [4] We evaluate the performance of different contexts (Cn) on a set of Reasoning Models that are trained on different mixtures of training datasets, including a mixture containing RATD datasets, and a mixture without these. In the diagram, red denotes false information and green highlights relevant and truthful evidence.
  • Figure 4: Examples of combining contexts. For a question Q, we acquire two contexts, C1 and C2. The resulting combined context for our combination methods with example thresholds and RR model scores is then shown in blue boxes where "+" denotes the concatenation of C1 and C2. The Naïve Concatenation is always C1 + C2. Formatted examples of resulting contexts are shown at the bottom of the figure with titles shown in bold for readability. The phrase "Further Explanation" is added to the rationale in a concatenated context to mimic a document title.