Table of Contents
Fetching ...

The Web Can Be Your Oyster for Improving Large Language Models

Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jingyuan Wang, Jian-Yun Nie, Ji-Rong Wen

TL;DR

The paper tackles the stagnation of static world knowledge in large language models by proposing UniWeb, a unified web-augmented LLM that retrieves up-to-date information from the web via a confidence-driven policy and integrates it through continual knowledge learning. It unifies 16 knowledge-intensive tasks into a text-to-text framework and trains the model in a massively multi-task manner, using an adaptive retrieval gate and a CKL objective to align retrieved content with encoded knowledge. Empirical results show UniWeb achieves best or near-best performance across the 16 tasks, with clear advantages over Wikipedia- or CCNet-based retrieval and single-task Web systems, and strong case studies on real-time QA. This approach demonstrates the practical potential of live web content to extend LLM knowledge beyond static pretraining, enabling more accurate, up-to-date, and broadly capable knowledge-intensive NLP systems.

Abstract

Large language models (LLMs) encode a large amount of world knowledge. However, as such knowledge is frozen at the time of model training, the models become static and limited by the training data at that time. In order to further improve the capacity of LLMs for knowledge-intensive tasks, we consider augmenting LLMs with the large-scale web using search engine. Unlike previous augmentation sources (e.g., Wikipedia data dump), the web provides broader, more comprehensive and constantly updated information. In this paper, we present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format. Instead of simply using the retrieved contents from web, our approach has made two major improvements. Firstly, we propose an adaptive search engine assisted learning method that can self-evaluate the confidence level of LLM's predictions, and adaptively determine when to refer to the web for more data, which can avoid useless or noisy augmentation from web. Secondly, we design a pretraining task, i.e., continual knowledge learning, based on salient spans prediction, to reduce the discrepancy between the encoded and retrieved knowledge. Experiments on a wide range of knowledge-intensive tasks show that our model significantly outperforms previous retrieval-augmented methods.

The Web Can Be Your Oyster for Improving Large Language Models

TL;DR

The paper tackles the stagnation of static world knowledge in large language models by proposing UniWeb, a unified web-augmented LLM that retrieves up-to-date information from the web via a confidence-driven policy and integrates it through continual knowledge learning. It unifies 16 knowledge-intensive tasks into a text-to-text framework and trains the model in a massively multi-task manner, using an adaptive retrieval gate and a CKL objective to align retrieved content with encoded knowledge. Empirical results show UniWeb achieves best or near-best performance across the 16 tasks, with clear advantages over Wikipedia- or CCNet-based retrieval and single-task Web systems, and strong case studies on real-time QA. This approach demonstrates the practical potential of live web content to extend LLM knowledge beyond static pretraining, enabling more accurate, up-to-date, and broadly capable knowledge-intensive NLP systems.

Abstract

Large language models (LLMs) encode a large amount of world knowledge. However, as such knowledge is frozen at the time of model training, the models become static and limited by the training data at that time. In order to further improve the capacity of LLMs for knowledge-intensive tasks, we consider augmenting LLMs with the large-scale web using search engine. Unlike previous augmentation sources (e.g., Wikipedia data dump), the web provides broader, more comprehensive and constantly updated information. In this paper, we present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format. Instead of simply using the retrieved contents from web, our approach has made two major improvements. Firstly, we propose an adaptive search engine assisted learning method that can self-evaluate the confidence level of LLM's predictions, and adaptively determine when to refer to the web for more data, which can avoid useless or noisy augmentation from web. Secondly, we design a pretraining task, i.e., continual knowledge learning, based on salient spans prediction, to reduce the discrepancy between the encoded and retrieved knowledge. Experiments on a wide range of knowledge-intensive tasks show that our model significantly outperforms previous retrieval-augmented methods.
Paper Structure (20 sections, 6 equations, 3 figures, 9 tables)

This paper contains 20 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of our proposed web-augmented large language model UniWeb.
  • Figure 2: (a) Entropy of samples in HotpotQA; (b) Accuracy w.r.t different top-$K$ documents.
  • Figure 3: (a) Probability of True for prompts in HotpotQA; (b) Loss of samples in HotpotQA.