Table of Contents
Fetching ...

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

He Zhu, Junyou Su, Tianle Lun, Yicheng Tao, Wenjia Zhang, Zipei Fan, Guanhua Chen

TL;DR

FANNO presents a fully autonomous, open-source framework for creating high-quality instruction-following data using only open LLMs, addressing the cost and scarcity of manually annotated datasets. It decomposes annotation into document pre-screening, seed instruction generation, instruction augmentation with UCB-based selection, and response generation with retrieval-augmented context, achieving diverse and complex data without proprietary APIs. Empirical results on Open LLM Leaderboard, AlpacaEval, and MT-Bench show FANNO-tuned models reach competitive or superior performance to established baselines like Alpaca-GPT4-Cleaned, while ablation studies confirm the value of each component. The work suggests a practical path toward democratizing access to high-quality instruction data and facilitating broader instruction-tuning research with open-source tools and models.

Abstract

Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.

FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only

TL;DR

FANNO presents a fully autonomous, open-source framework for creating high-quality instruction-following data using only open LLMs, addressing the cost and scarcity of manually annotated datasets. It decomposes annotation into document pre-screening, seed instruction generation, instruction augmentation with UCB-based selection, and response generation with retrieval-augmented context, achieving diverse and complex data without proprietary APIs. Empirical results on Open LLM Leaderboard, AlpacaEval, and MT-Bench show FANNO-tuned models reach competitive or superior performance to established baselines like Alpaca-GPT4-Cleaned, while ablation studies confirm the value of each component. The work suggests a practical path toward democratizing access to high-quality instruction data and facilitating broader instruction-tuning research with open-source tools and models.

Abstract

Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned.
Paper Structure (53 sections, 9 figures, 15 tables)

This paper contains 53 sections, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Overview of Fanno framework. (1) Document Pre-Screen: We process the unlabeled text data with filters and community detection algorithm. (2a) Seed Instruction Generation: Fanno generates seed instructions from pre-screened documents with diverse task types and difficulty levels through a tag pool. (2b) Instruction Augmentation: New instructions are augmented conditioned on the documents and few-shot examples selected from the seed instructions with the UCB algorithm. (3) Response Generation: The responses to instructions are generated directly by the teacher LLM or based on the concatenation of the corresponding document and retrieved document.
  • Figure 2: AlpacaEval Result
  • Figure 3: The verbs-noun statistics data grows with iteration
  • Figure 4: The instruction length (complexity) grows with iteration
  • Figure 5: Fanno Instruction Length Distribution
  • ...and 4 more figures