Table of Contents
Fetching ...

Thrust: Adaptively Propels Large Language Models with External Knowledge

Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Jianshu Chen

TL;DR

This work tackles the cost and noise of retrieving external knowledge for large language models by introducing Instance-level Adaptive Propulsion of External Knowledge (IAPEK). It hinges on Thrust, an instance-level knowledgeability score computed from compact hidden representations and cluster centroids, to decide whether external retrieval is necessary, formally via a threshold on $s(q)$. Across seven MC tasks and five open-domain QA tasks, Thrust correlates with knowledge needs and enables cost-efficient augmentation, achieving up to a 26% average performance improvement on 88% of tasks under budgeted retrieval, while sometimes surpassing full-knowledge usage for a subset of tasks. The results demonstrate that selective, threshold-driven retrieval can reduce noise and latency without sacrificing, and often improving, performance, offering practical guidance for deploying knowledge-enhanced LMs in resource-constrained settings.

Abstract

Although large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters, the inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary. However, the existing information retrieval techniques could be costly and may even introduce noisy and sometimes misleading knowledge. To address these challenges, we propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary. To achieve this goal, we propose measuring whether a PTLM contains enough knowledge to solve an instance with a novel metric, Thrust, which leverages the representation distribution of a small number of seen instances. Extensive experiments demonstrate that thrust is a good measurement of PTLM models' instance-level knowledgeability. Moreover, we can achieve significantly higher cost-efficiency with the Thrust score as the retrieval indicator than the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings shed light on the real-world practice of knowledge-enhanced LMs with a limited knowledge-seeking budget due to computation latency or costs.

Thrust: Adaptively Propels Large Language Models with External Knowledge

TL;DR

This work tackles the cost and noise of retrieving external knowledge for large language models by introducing Instance-level Adaptive Propulsion of External Knowledge (IAPEK). It hinges on Thrust, an instance-level knowledgeability score computed from compact hidden representations and cluster centroids, to decide whether external retrieval is necessary, formally via a threshold on . Across seven MC tasks and five open-domain QA tasks, Thrust correlates with knowledge needs and enables cost-efficient augmentation, achieving up to a 26% average performance improvement on 88% of tasks under budgeted retrieval, while sometimes surpassing full-knowledge usage for a subset of tasks. The results demonstrate that selective, threshold-driven retrieval can reduce noise and latency without sacrificing, and often improving, performance, offering practical guidance for deploying knowledge-enhanced LMs in resource-constrained settings.

Abstract

Although large-scale pre-trained language models (PTLMs) are shown to encode rich knowledge in their model parameters, the inherent knowledge in PTLMs can be opaque or static, making external knowledge necessary. However, the existing information retrieval techniques could be costly and may even introduce noisy and sometimes misleading knowledge. To address these challenges, we propose the instance-level adaptive propulsion of external knowledge (IAPEK), where we only conduct the retrieval when necessary. To achieve this goal, we propose measuring whether a PTLM contains enough knowledge to solve an instance with a novel metric, Thrust, which leverages the representation distribution of a small number of seen instances. Extensive experiments demonstrate that thrust is a good measurement of PTLM models' instance-level knowledgeability. Moreover, we can achieve significantly higher cost-efficiency with the Thrust score as the retrieval indicator than the naive usage of external knowledge on 88% of the evaluated tasks with 26% average performance improvement. Such findings shed light on the real-world practice of knowledge-enhanced LMs with a limited knowledge-seeking budget due to computation latency or costs.
Paper Structure (29 sections, 1 equation, 9 figures, 6 tables)

This paper contains 29 sections, 1 equation, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The predictions by OPT-175B without/with external knowledge retrieved via DPR karpukhin-etal-2020-dense from Wikipedia paragraphs. Although the top retrieved paragraphs are relevant since the internal knowledge is already sufficient, the external knowledge can either be misleading (potentially due to the effect of misprimekassner-schutze-2020-negated) or less useful.
  • Figure 2: The pipeline of retrieval-augmented models with IAPEK. Unlike previous work (e.g., RAG DBLP:journals/corr/abs-2005-11401) where models directly seek for help from the retriever module, IAPEK module provides a confidence score $S(q)$ (e.g., Thrust) on how well the PTLM can answer the question with internal knowledge and decides if the external retrieval is necessary.
  • Figure 3: The intuition behind the proposed Thrust, which are plotted in the hidden representation space of PTLM. We represent an incoming query instance by triangles and represent the instances used for constructing Thrust scores by ticks and crosses. In the controversial and no knowledge cases, the internal knowledge is insufficient to answer the query successfully, and external knowledge is needed to facilitate PTLM. In contrast, if the model finds the query close to one of the clusters, internal knowledge should be sufficient to solve the problem so that external knowledge is unnecessary.
  • Figure 4: Performance of various models on MC classification tasks (accuracy) and open-domain QA tasks (QA-F1), denoted by (cls) and (qa), respectively. The x-axis represents the model names, which are shared across sub-figures. Use knowledge: yes or no denotes using full knowledge or not for all queries. UnifiedQA denotes T5 models with different sizes fine-tuned on the UnifiedQA dataset.
  • Figure 5: Distribution of Thrust scores for various tasks by using UnifiedQA-3b to create the hidden representations. Kernel Density Estimation is used to smooth the distributions. Low scores imply that the instances are less likely to be solved with internal knowledge and vice versa. Thrust scores predict that most query instances from open-domain QA tasks require external knowledge while others need less. The results are consistent with the original design purposes of these tasks.
  • ...and 4 more figures