Table of Contents
Fetching ...

Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models

Jiaxin Zhang, Wendi Cui, Yiran Huang, Kamalika Das, Sricharan Kumar

TL;DR

A novel synthetic knowledge ingestion method called , which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources to inject and refine knowledge in language models.

Abstract

Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models

TL;DR

A novel synthetic knowledge ingestion method called , which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources to inject and refine knowledge in language models.

Abstract

Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we propose a novel synthetic knowledge ingestion method called Ski, which leverages fine-grained synthesis, interleaved generation, and assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate Ski and its variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that Ski significantly outperforms baseline methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy of LLM outputs by refining knowledge representation and injection capabilities.

Paper Structure

This paper contains 41 sections, 10 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Illustrative example. Base models are often unable to handle certain questions due to limited knowledge. Even with raw knowledge provided, the output answer may still be incorrect if not well-digested by LLMs. Our proposed Synthetic Knowledge Ingestion method, Ski, incorporates three key innovations to easily transform raw knowledge into refined data representations that LLM can effectively digest. By utilizing injection pipelines such as RAG, SFT, and CPT, knowledge or information will be injected into LLM to ensure accurate and correct answers.
  • Figure 2: Overview of the proposed method: Synthetic Knowledge Ingestion (Ski), comprises of three essential components. Firstly, fine-grained synthesis incorporates generating hypothetical questions based on an $n$-gram detail-oriented principal. Secondly, interleaved generation generates questions and answers simultaneously by maintaining aligned harmony, given a specific knowledge piece. Lastly, assemble augmentation combines question, answer, and context pairs, along with $n$-gram synthesis, to improve the repetition with diversity.
  • Figure 3: Effect of Ski method on the RAG performance using different retrievers and generators.
  • Figure 4: SFT performance on different $n$-gram settings using Llama2-7B base model across three datasets. Note that the F1-score often decreases with the increase of $n$-gram, but not very significantly drop.
  • Figure 5: The relative performance (F1-score) gain for RAG, SFT, and CPT over averaged across all datasets.