Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching
Shulin Liu, Chengcheng Xu, Hao Liu, Tinghao Yu, Tao Yang
TL;DR
The paper investigates whether large language models can serve as effective backbones for supervised fine-tuning on natural language understanding, focusing on Chinese short text matching. It systematically compares task modeling (generative vs discriminative), prompt formats (concise vs complex), and output formats (with/without Chain of Thought) using two datasets LCQMC and BQ. The results show that a fine-tuned Chinese-enhanced LLM (CLLM-7B) can outperform fine-tuned BERT and even few-shot GPT-4, with the generative paradigm and CoT providing particular advantages under limited data. Prompt design is less critical in supervised settings, and CoT in outputs yields consistent gains, suggesting practical guidance for supervised LLM fine-tuning on NLU tasks. The findings may extend to other NLU tasks beyond text matching, though limitations include reliance on prompt choices for few-shot, and focus on text matching.
Abstract
The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.
