Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

To Eun Kim; Alireza Salemi; Andrew Drozdov; Fernando Diaz; Hamed Zamani

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani

TL;DR

This tutorial addresses the disconnect in REML research by introducing core REML concepts and synthesizing the literature from various domains in machine learning (ML), including, but beyond NLP.

Abstract

In the field of language modeling, models augmented with retrieval components have emerged as a promising solution to address several challenges faced in the natural language processing (NLP) field, including knowledge grounding, interpretability, and scalability. Despite the primary focus on NLP, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology. Therefore, this work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature. Also, we found that while a number of studies employ retrieval components to augment their models, there is a lack of integration with foundational Information Retrieval (IR) research. We bridge this gap between the seminal IR research and contemporary REML studies by investigating each component that comprises the REML framework. Ultimately, the goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

TL;DR

This tutorial addresses the disconnect in REML research by introducing core REML concepts and synthesizing the literature from various domains in machine learning (ML), including, but beyond NLP.

Abstract

Paper Structure (99 sections, 19 equations, 2 figures, 5 tables)

This paper contains 99 sections, 19 equations, 2 figures, 5 tables.

Introduction
Background
Motivation
Applications of REML
Main Contributions of This Work
Retrieval-Enhanced Machine Learning
Querying
Deciding Where to Query
Corpus Selection
Retriever Selection
Reformulating the Input
Compression
Expansion
Conversion
Decomposing the Input
...and 84 more sections

Figures (2)

Figure 1: Retrieval-enhanced machine learning models should implement three necessary requirements (querying, retrieval, and response utilization) and may implement two optional properties (storing information and providing feedback to the information access model). This results in four categories of REML models presented above. Figure is taken from zamani:reml.
Figure 2: A generic framework for REML zamani:reml. The multiplicative nature of the information access process implies that the access to the information can be distributed and/or be done iteratively. Note that each component do not have to be completely separated, e.g., Query Generation or Response Processing module can be dealt within the Predictive Model. In abstract, however, we consider them as one of the components of information access process that can be described separately.

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

TL;DR

Abstract

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities

Authors

TL;DR

Abstract

Table of Contents

Figures (2)