Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

Shulin Liu; Chengcheng Xu; Hao Liu; Tinghao Yu; Tao Yang

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

Shulin Liu, Chengcheng Xu, Hao Liu, Tinghao Yu, Tao Yang

TL;DR

The paper investigates whether large language models can serve as effective backbones for supervised fine-tuning on natural language understanding, focusing on Chinese short text matching. It systematically compares task modeling (generative vs discriminative), prompt formats (concise vs complex), and output formats (with/without Chain of Thought) using two datasets LCQMC and BQ. The results show that a fine-tuned Chinese-enhanced LLM (CLLM-7B) can outperform fine-tuned BERT and even few-shot GPT-4, with the generative paradigm and CoT providing particular advantages under limited data. Prompt design is less critical in supervised settings, and CoT in outputs yields consistent gains, suggesting practical guidance for supervised LLM fine-tuning on NLU tasks. The findings may extend to other NLU tasks beyond text matching, though limitations include reliance on prompt choices for few-shot, and focus on text matching.

Abstract

The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

TL;DR

Abstract

Paper Structure (10 sections, 10 figures)

This paper contains 10 sections, 10 figures.

Introduction
Backgrounds
Task Definition
Datasets and Metrics
Experiments and Results
Generative vs. Discriminative Models
Concise vs. Complex Prompts
Effects of CoT
Conclusions
Appendix

Figures (10)

Figure 1: Model structures of modeling text matching as generative and discriminant task.
Figure 2: The results of models trained on 5,000, 20,000, 80,000 samples as well as trained on the entire training set.
Figure 3: The results of concise and complex prompts.
Figure 4: Illustration of how to obtain CoT via GPT-4. All original texts in this figure are in Chinese. For ease of reading, we translated them. The original version is illustrated in Figure \ref{['Figure:cot-chinese']} in Appendix.
Figure 5: Results of models trained with CoT.
...and 5 more figures

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

TL;DR

Abstract

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (10)