Table of Contents
Fetching ...

Sample Design Engineering: An Empirical Study of What Makes Good Downstream Fine-Tuning Samples for LLMs

Biyang Guo, He Wang, Wenyilin Xiao, Hong Chen, Zhuxin Lee, Songqiao Han, Hailiang Huang

TL;DR

This paper defines Sample Design Engineering (SDE) as a systematic approach to improve downstream fine-tuning of LLMs by optimizing input, output, and reasoning designs. It conducts in-domain and out-of-domain experiments across six open-source 7B LLMs on MASA tasks, revealing robust patterns such as the benefits of instruction placement and the varying impact of CoT across tasks. The authors propose ES-SDE, an integrated strategy combining the best-performing options (Inst-first, No-MI, Lines, PU, TxtLabel) and validate its superiority over heuristic designs on Nested-NER, Event Detection, and MASA. They further analyze the relationship between prompt engineering and sample design, showing that effective PE does not reliably predict SDE success, and discuss limitations and future research directions.

Abstract

In the burgeoning field of Large Language Models (LLMs) like ChatGPT and LLaMA, Prompt Engineering (PE) is renowned for boosting zero-shot or in-context learning (ICL) through prompt modifications. Yet, the realm of the sample design for downstream fine-tuning, crucial for task-specific LLM adaptation, is largely unexplored. This paper introduces Sample Design Engineering (SDE), a methodical approach to enhancing LLMs' post-tuning performance by refining input, output, and reasoning designs. We conduct a series of in-domain (ID) and out-of-domain (OOD) experiments to assess the impact of various design options on LLMs' downstream performance, revealing several intriguing patterns that hold consistently across different LLMs. Based on these insights, we propose an integrated SDE strategy, combining the most effective options, and validate its consistent superiority over heuristic sample designs in complex downstream tasks like multi-aspect sentiment analysis, event extraction, and nested entity recognition. Additionally, analyses of LLMs' inherent prompt/output perplexity, zero-shot, and ICL abilities illustrate that good PE strategies may not always translate to good SDE strategies. Code available at https://github.com/beyondguo/LLM-Tuning.

Sample Design Engineering: An Empirical Study of What Makes Good Downstream Fine-Tuning Samples for LLMs

TL;DR

This paper defines Sample Design Engineering (SDE) as a systematic approach to improve downstream fine-tuning of LLMs by optimizing input, output, and reasoning designs. It conducts in-domain and out-of-domain experiments across six open-source 7B LLMs on MASA tasks, revealing robust patterns such as the benefits of instruction placement and the varying impact of CoT across tasks. The authors propose ES-SDE, an integrated strategy combining the best-performing options (Inst-first, No-MI, Lines, PU, TxtLabel) and validate its superiority over heuristic designs on Nested-NER, Event Detection, and MASA. They further analyze the relationship between prompt engineering and sample design, showing that effective PE does not reliably predict SDE success, and discuss limitations and future research directions.

Abstract

In the burgeoning field of Large Language Models (LLMs) like ChatGPT and LLaMA, Prompt Engineering (PE) is renowned for boosting zero-shot or in-context learning (ICL) through prompt modifications. Yet, the realm of the sample design for downstream fine-tuning, crucial for task-specific LLM adaptation, is largely unexplored. This paper introduces Sample Design Engineering (SDE), a methodical approach to enhancing LLMs' post-tuning performance by refining input, output, and reasoning designs. We conduct a series of in-domain (ID) and out-of-domain (OOD) experiments to assess the impact of various design options on LLMs' downstream performance, revealing several intriguing patterns that hold consistently across different LLMs. Based on these insights, we propose an integrated SDE strategy, combining the most effective options, and validate its consistent superiority over heuristic sample designs in complex downstream tasks like multi-aspect sentiment analysis, event extraction, and nested entity recognition. Additionally, analyses of LLMs' inherent prompt/output perplexity, zero-shot, and ICL abilities illustrate that good PE strategies may not always translate to good SDE strategies. Code available at https://github.com/beyondguo/LLM-Tuning.
Paper Structure (33 sections, 12 figures, 14 tables)

This paper contains 33 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: A simplified comparison between PE and our proposed SDE.
  • Figure 2: Typical SDE options to be considered when designing downstream-tuning samples, taking the MASA task as an example. $Ai$ means aspect $i$, $Si$ means its sentiment label, [P] refers to placeholder tokens.
  • Figure 3: An example for the MASA task.
  • Figure 4: Sentiment analysis performances ($\kappa$) of different SDE options. Results of ID are the average of D1->D1 and D2->D2, same for OOD. The bars depict each method's relative improvement or degradation compared to the baseline, with each method differing from the baseline in only one option (colored in red). Detailed results for each task see Table \ref{['fig:LLaMA2-Chat_result']}-\ref{['fig:Baichuan2-Base_result']}.
  • Figure 5: Format adherence performance, measured by parsing error rates (%). '*' means same option as above.
  • ...and 7 more figures