Table of Contents
Fetching ...

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Xuan Ren, Biao Wu, Lingqiao Liu

TL;DR

The paper examines why fine-tuning a target LLM on responses generated by LLMs often surpasses training on human-provided answers in reasoning tasks. It demonstrates that the model’s familiarity with LLM-style data, as reflected by lower perplexity, substantially contributes to this advantage, beyond mere detail richness. Through a series of ablations, style-transfer experiments, and a minimal-change hybrid approach, the authors show that familiarity improves in-domain and cross-task performance, and can be achieved without resorting to more powerful models. The findings suggest practical data-annotation strategies that exploit LLM-generated data to enhance fine-tuning efficiency and generalization, while outlining limitations related to transformation quality and task-specific challenges.

Abstract

This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans, particularly in reasoning tasks. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more "familiar" with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the "familiarity" and our conclusion reveals that this "familiarity" significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other reasoning tasks after fine-tuning on a specific task.

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

TL;DR

The paper examines why fine-tuning a target LLM on responses generated by LLMs often surpasses training on human-provided answers in reasoning tasks. It demonstrates that the model’s familiarity with LLM-style data, as reflected by lower perplexity, substantially contributes to this advantage, beyond mere detail richness. Through a series of ablations, style-transfer experiments, and a minimal-change hybrid approach, the authors show that familiarity improves in-domain and cross-task performance, and can be achieved without resorting to more powerful models. The findings suggest practical data-annotation strategies that exploit LLM-generated data to enhance fine-tuning efficiency and generalization, while outlining limitations related to transformation quality and task-specific challenges.

Abstract

This paper explores an intriguing observation: fine-tuning a large language model (LLM) with responses generated by a LLM often yields better results than using responses generated by humans, particularly in reasoning tasks. We conduct an in-depth investigation to understand why this occurs. Contrary to the common belief that these instances is due to the more detailed nature of LLM-generated content, our study identifies another contributing factor: an LLM is inherently more "familiar" with LLM generated responses. This familiarity is evidenced by lower perplexity before fine-tuning. We design a series of experiments to understand the impact of the "familiarity" and our conclusion reveals that this "familiarity" significantly impacts learning performance. Training with LLM-generated responses not only enhances performance but also helps maintain the model's capabilities in other reasoning tasks after fine-tuning on a specific task.
Paper Structure (23 sections, 6 figures, 8 tables)

This paper contains 23 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: This figure shows training outcomes for different data generation methods, demonstrating that more details do not always improve performance.
  • Figure 2: Average Perplexity Comparison
  • Figure 3: Minimum Change Data Correction Examples
  • Figure 4: Minimum Change Prompt Example
  • Figure 5: Llama2 groundtruth style transfer failure example
  • ...and 1 more figures