Unveiling the Generalization Power of Fine-Tuned Large Language Models

Haoran Yang; Yumeng Zhang; Jiaqi Xu; Hongyuan Lu; Pheng Ann Heng; Wai Lam

Unveiling the Generalization Power of Fine-Tuned Large Language Models

Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng Ann Heng, Wai Lam

TL;DR

It is observed that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model’s generalization ability and contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.

Abstract

While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs' generalization ability are not fully understood. This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets. Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability. Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.

Unveiling the Generalization Power of Fine-Tuned Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 7 figures, 5 tables)

This paper contains 36 sections, 7 figures, 5 tables.

Introduction
Related Work
Large Language Models
Fine-tuning vs. In-Context Learning
Evaluation Design
Evaluation Taxonomy
Evaluation Benchmarks
Summary Generation
Question Generation
Sentiment Classification
Paraphrase Detection
Natural Language Inference
Experimental Setup
Models & Metrics
Training Details
...and 21 more sections

Figures (7)

Figure 1: In-domain dataset testing performance comparisons of baseline Llama-2 ($0$ training samples, orange line) and its fine-tuned variants ($2K, 4K, 6K$ training samples). Shot denotes the number of in-context examples. The caption for each subfigure refers to the test set. The corresponding training set can be found in Table \ref{['tab:datasets']}. The 0-shot results of the baseline Llama model in (a), (b) and (c) are not presented since in scenarios where in-context examples are absent (0-shot), baseline models generally struggle to execute the tasks effectively, even when the prompt explicitly outlines the task requirements.
Figure 2: Out-of-domain dataset testing performance comparisons of baseline Llama-2 ($0$ training samples, orange line) and its fine-tuned variants ($2K, 4K, 6K$ training samples).
Figure 3: Cross-task performance comparisons of baseline Llama-2 ($0$ training samples, orange line) and models fine-tuned on other tasks. The caption for each subfigure refers to the test set. The legends denote the training data. The first row is the results using the Prompt-1 (p1) and the second row is the results using the Prompt-2 (p2) format. The detailed prompt formats can be found in Appx. \ref{['appx:promt']}.
Figure 4: Same fine-tuning/test task type evaluation of FTICL with generation tasks. B$n$ represents the baseline Llama-2 model with $n$ in-context examples during inference. FC$n$ denotes the FTICL models fine-tuned with $n$ in-context examples. FT is the vanilla fine-tuned model without in-context learning. For FC$n$ and FT, we fine-tune with 2,000 samples, perform both 0-shot and few-shot evaluations, and report the results with the best performance.
Figure 5: Cross-task performance of FTICL with generation tasks. For the classification task evaluation, we also report the 0-shot performance (B0) for the baseline Llama-2.
...and 2 more figures

Unveiling the Generalization Power of Fine-Tuned Large Language Models

TL;DR

Abstract

Unveiling the Generalization Power of Fine-Tuned Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)