Labeling supervised fine-tuning data with the scaling law
Huanjun Kong
TL;DR
The paper addresses obtaining high-quality SFT data under resource constraints by leveraging a scaling-law guided annotation process. It constructs a dataset from 58k raw group-chat lines, producing 2,302 inquiries in alpaca-format prompts and accompanying LoRA-based fine-tuning data, then uses a multi-stage evaluation loop across Qwen baselines to calibrate annotation quality to model size. Fine-tuning Qwen models with LoRA demonstrates substantial gains in F1 scores across sizes (up to 85.58% for 32B and 61.93% for MoE-2.7B), with learning-rate adjustments needed for certain configurations, validating the scaling-law alignment. The approach is practical for constrained environments and is supported by open-source tooling and data to enable reproducibility and further exploration.
Abstract
This paper introduces a multi-stage manual annotation calibrated by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method for environments with constrained resources like GPU poor, limited GPT access, and funding restrictions. We have preprocessed 58k authentic chat data and manually annotated 2.3k questions. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github (https://github.com/InternLM/HuixiangDou/tree/main/web/tools), HuggingFace (https://huggingface.co/tpoisonooo) and WandB (https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo). The privacy of the data involved has been authorized by users. SFT data and license comes from ncnn contributors group.
