Table of Contents
Fetching ...

Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model

Xia Hou, Qifeng Li, Jian Yang, Tongliang Li, Linzheng Chai, Xianjie Wu, Hangyuan Ji, Zhoujun Li, Jixuan Nie, Jingbo Dun, Wenfeng Song

TL;DR

R2S presents a principled framework to inject raw-document knowledge into large language models via instruction tuning. It introduces Chain of Dialogue (CoD) to steer LLMs in generating knowledge-rich, multi-turn dialogues from raw text, supported by the k-Bench knowledge-intensive benchmark and the gInstruct synthetic dataset. A low-cost open-source generator, gLLM, is trained to transform raw documents into cohesive dialogues, enabling effective SFT data generation and downstream fine-tuning. Across diverse domains, CoD-based data improves informativeness, coherence, fidelity, and content coverage, demonstrating a practical path to knowledge-aware, long-horizon instruction tuning.

Abstract

Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.

Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model

TL;DR

R2S presents a principled framework to inject raw-document knowledge into large language models via instruction tuning. It introduces Chain of Dialogue (CoD) to steer LLMs in generating knowledge-rich, multi-turn dialogues from raw text, supported by the k-Bench knowledge-intensive benchmark and the gInstruct synthetic dataset. A low-cost open-source generator, gLLM, is trained to transform raw documents into cohesive dialogues, enabling effective SFT data generation and downstream fine-tuning. Across diverse domains, CoD-based data improves informativeness, coherence, fidelity, and content coverage, demonstrating a practical path to knowledge-aware, long-horizon instruction tuning.

Abstract

Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.
Paper Structure (37 sections, 3 equations, 3 figures, 5 tables)

This paper contains 37 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of standard domain-specific training with our proposed R2S.
  • Figure 2: Framework of R2S.
  • Figure 3: Comparison of dialogues responses generated by direct models and R2S