Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance
Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, Wei Bi
TL;DR
The paper investigates whether open-source LLaMA models can be effectively tuned for a constrained writing-assistant scenario. By assembling a seven-task benchmark and 60k writing-instruction examples alongside 52k generic Alpaca data, it demonstrates that instruction tuning substantially boosts performance, with additional gains from scenario-specific data on some tasks. The findings show small LLaMA variants can outperform larger off-the-shelf models in this vertical, though not all task-specific SOTA and at notable computational and deployment costs, including potential hallucinations. The work highlights practical trade-offs in deploying task-focused LLMs and argues for careful data and cost considerations when pursuing single-task LLM applications.
Abstract
Proprietary Large Language Models (LLMs), such as ChatGPT, have garnered significant attention due to their exceptional capabilities in handling a diverse range of tasks. Recent studies demonstrate that open-sourced smaller foundational models, such as 7B-size LLaMA, can also display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data. In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following, and explore whether LLMs can be beneficial and further improved for such targeted scenarios. We choose the writing-assistant scenario as the testbed, which includes seven writing tasks. We collect training data for these tasks, reframe them in an instruction-following format, and subsequently refine the LLM, specifically LLaMA, via instruction tuning. Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks. We also conduct more experiments and analyses to offer insights for future work on effectively fine-tuning LLaMA for specific scenarios. Finally, we initiate a discussion regarding the necessity of employing LLMs for only one targeted task, taking into account the efforts required for tuning and the resources consumed during deployment.
