Table of Contents
Fetching ...

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Scott Barnett, Zac Brannelly, Stefanus Kurniawan, Sheng Wong

TL;DR

This paper questions the perceived universal benefits of fine-tuning LLMs when they are embedded in Retrieval-Augmented Generation pipelines for domain-specific QA. By evaluating Mistral, LlaMA2, and GPT-4 across BioASQ, Natural Questions, and Qasper with fine-tuning on 200, 500, and 1000 QA pairs, the study finds that base (untuned) models generally outperform their fine-tuned counterparts in both accuracy and completeness, with Qasper exhibiting the most pronounced declines. Increasing fine-tuning data did not yield improvements and often reduced performance, suggesting that fine-tuning within RAG can harm data extraction and contextual integration. These results highlight the need for rigorous validation before deploying fine-tuning for domain adaptation in RAG systems and motivate future work to identify when and how fine-tuning may be beneficial.

Abstract

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.

Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

TL;DR

This paper questions the perceived universal benefits of fine-tuning LLMs when they are embedded in Retrieval-Augmented Generation pipelines for domain-specific QA. By evaluating Mistral, LlaMA2, and GPT-4 across BioASQ, Natural Questions, and Qasper with fine-tuning on 200, 500, and 1000 QA pairs, the study finds that base (untuned) models generally outperform their fine-tuned counterparts in both accuracy and completeness, with Qasper exhibiting the most pronounced declines. Increasing fine-tuning data did not yield improvements and often reduced performance, suggesting that fine-tuning within RAG can harm data extraction and contextual integration. These results highlight the need for rigorous validation before deploying fine-tuning for domain adaptation in RAG systems and motivate future work to identify when and how fine-tuning may be beneficial.

Abstract

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.
Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Comparisons of accuracy for fine-tuned Llama2 models and baseline models across three datasets.
  • Figure 2: Comparisons of accuracy for fine-tuned Mixtral models and baseline models across three datasets.
  • Figure 3: Comparisons of completeness for fine-tuned Llama2 models and baseline models across three datasets.
  • Figure 4: Comparisons of completeness for fine-tuned Mixtral models and baseline models across three datasets.