Table of Contents
Fetching ...

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat

TL;DR

The experimental results show that the best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions.

Abstract

We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai

TL;DR

The experimental results show that the best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions.

Abstract

We present a synthetic data approach for instruction-tuning large language models (LLMs) for low-resource languages in a data-efficient manner, specifically focusing on Thai. We identify three key properties that contribute to the effectiveness of instruction-tuning datasets: fluency, diversity, and cultural context. We propose a seed-data-free framework for generating synthetic instruction-tuning data that incorporates these essential properties. Our framework employs an LLM to generate diverse topics, retrieve relevant contexts from Wikipedia, and create instructions for various tasks, such as question answering, summarization, and conversation. The experimental results show that our best-performing synthetic dataset, which incorporates all three key properties, achieves competitive performance using only 5,000 instructions when compared to state-of-the-art Thai LLMs trained on hundreds of thousands of instructions. Our code and dataset are publicly available at https://github.com/parinzee/seed-free-synthetic-instruct.

Paper Structure

This paper contains 21 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Our proposed framework for generating synthetic instruction-tuning datasets for low-resource languages from scratch with fluency, diversity, and cultural context.
  • Figure 2: Comparison of BERTScores of our best synthetic model and Typhoon-Instruct on the average scores from both test sets. We also performed Wilcoxon rank-sum tests c4091bd3-d888-3152-8886-c284bf66a93a comparing F+ C+ D+ against Typhoon-Instruct for each task on both the Thai culture-specific and general test sets, and found that the differences were statistically significant (p < 0.05) for all tasks, with an average Wilcoxon statistic of -6.512 and an average p-value of 0.00073 across all comparisons.
  • Figure 3: Comparison of average generation lengths across all tasks and both benchmarks. A Wilcoxon rank-sum test was conducted to compare the generation lengths of our best model (F+ C+ D+) and Typhoon-Instruct. The results showed a statistically significant difference (W = -54.233, p < 0.00001), indicating that our model generates significantly shorter outputs compared to Typhoon-Instruct.