Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

Yuncheng Yang; Yulei Qin; Tong Wu; Zihan Xu; Gang Li; Pengcheng Guo; Hang Shao; Yuchen Shi; Ke Li; Xing Sun; Jie Yang; Yun Gu

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

Yuncheng Yang, Yulei Qin, Tong Wu, Zihan Xu, Gang Li, Pengcheng Guo, Hang Shao, Yuchen Shi, Ke Li, Xing Sun, Jie Yang, Yun Gu

TL;DR

An efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions and the insistence on diversity is developed.

Abstract

The cultivation of expertise for large language models (LLMs) to solve tasks of specific areas often requires special-purpose tuning with calibrated behaviors on the expected stable outputs. To avoid huge cost brought by manual preparation of instruction datasets and training resources up to hundreds of hours, the exploitation of open knowledge including a wealth of low rank adaptation (LoRA) models and instruction datasets serves as a good starting point. However, existing methods on model and data selection focus on the performance of general-purpose capabilities while neglecting the knowledge gap exposed in domain-specific deployment. In the present study, we propose to bridge such gap by introducing few human-annotated samples (i.e., K-shot) for advancing task expertise of LLMs with open knowledge. Specifically, we develop an efficient and scalable pipeline to cost-efficiently produce task experts where K-shot data intervene in selecting the most promising expert candidates and the task-relevant instructions. A mixture-of-expert (MoE) system is built to make the best use of individual-yet-complementary knowledge between multiple experts. We unveil the two keys to the success of a MoE system, 1) the abidance by K-shot, and 2) the insistence on diversity. For the former, we ensure that models that truly possess problem-solving abilities on K-shot are selected rather than those blind guessers. Besides, during data selection, instructions that share task-relevant contexts with K-shot are prioritized. For the latter, we highlight the diversity of constituting experts and that of the fine-tuning instructions throughout the model and data selection process. Extensive experimental results confirm the superiority of our approach over existing methods on utilization of open knowledge across various tasks. Our codes will be available at https://github.com/Yaphabates/Rocket.

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

TL;DR

Abstract

Paper Structure (56 sections, 16 equations, 11 figures, 18 tables, 2 algorithms)

This paper contains 56 sections, 16 equations, 11 figures, 18 tables, 2 algorithms.

Introduction
Related Works
Efficient Fine-tuning of Parameters
LoRA
Prompt Tuning
Adapters
Mixture-of-Expert Models
Data Selection for Efficient Tuning
Quality
Diversity
Importance
Methodology
LoRA Bank Construction
Data Sources
Data Preprocessing
...and 41 more sections

Figures (11)

Figure 1: Given few annotated data from any task of interest ($K$-shot), we aim to advance LLMs in task expertise by leveraging open-source models and datasets. We propose an efficient and scalable pipeline to fully exploit the steering role of $K$-shot throughout model and data selection. Highly promising experts are first selected from the model bank by comprehensive consideration of their perplexity and performance on the $K$-shot and intra-group diversity. These experts are initialized as one MoE system. Subsequently, we perform data augmentation by selecting diverse open instructions that resemble $K$-shot the most. Finally, we fine-tune the MoE system with both $K$-shot and the augmented data, which not only improves token-wise cooperation between experts but also integrates broad knowledge into the system. The ultimate task expert benefits from the complementary skills and knowledge of constituting experts.
Figure 2: The performance of task-specific fine-tuning versus the reasoning perplexity of models in the bank. Preliminary experiments demonstrate that models of lower performance are not always in lack of domain-specific knowledge. Instead, their inability to follow instructions on the expected output format (e.g., answer choice) causes parsing failure during post-processing on the generated responses, which diminishes their performance. To avoid such biased, partial measurement merely by the metric such as exact-match accuracy, we propose to use the perplexity over the CoT rationales of answers as a superior, complementary proxy for model assessment. Accordingly, we evaluate if the model possesses the task-specific knowledge by computing its perplexity score of modeling the reasoning process. Models that achieve lower reasoning perplexity are considered competent and tend to achieve greater improvement after fine-tuning than those with higher reasoning perplexity.
Figure 3: Expansion of the CoT rationales on $K$-shot instructions.
Figure 4: The overall pipeline of our $K$-shot guided model selection strategy. A comprehensive assessment in terms of perplexity, performance, and diversity is conducted on each model for expert selection. Given $K$-shot data, we evaluate a model's performance via exact match accuracy on the directly inferred results. The reasoning perplexity is obtained by computing the perplexity on auto-regressive modeling of the CoT rationales towards answers of $K$-shot. The top-$M$ ranked candidate models are first selected to save computation of the subsequent intra-group diversity, where every $N$-tuple out of the $M$ candidates are involved. The $N$ models that share the lowest similarity in parameters (i.e., the largest group diversity) contribute to the initialization of a MoE system.
Figure 5: The architecture of our MoE system. It is implemented with LoRA modules, where the selected models from the LoRA bank are trained with an additional router to learn to assign different tokens to the responsible experts. Each token is routed to the top-$k$ activated experts with their representations multiplied by the corresponding routing weights for normalization.
...and 6 more figures

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

TL;DR

Abstract

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)