Table of Contents
Fetching ...

Fine-tuning Large Language Models with Sequential Instructions

Hanxu Hu, Simon Yu, Pinzhen Chen, Edoardo M. Ponti

TL;DR

This paper argues that standard instruction tuning, which typically handles single-step prompts, underperforms on queries requiring multiple interrelated tasks. It introduces sequential instruction tuning (SIT), with both manual (translate-then-predict, caption-then-answer) and automatic Seq-Instruct data augmentation pipelines, to train LLMs to decompose and follow multi-step instructions. A new SeqEval benchmark assesses instruction-following across chained tasks, while extensive experiments show SIT improves coding, maths, and open-ended generation, and generalizes across models and datasets. The work advances the capacity of open-source LLMs to tackle complex tasks and provides a practical framework for constructing and evaluating sequential instruction data.

Abstract

Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.

Fine-tuning Large Language Models with Sequential Instructions

TL;DR

This paper argues that standard instruction tuning, which typically handles single-step prompts, underperforms on queries requiring multiple interrelated tasks. It introduces sequential instruction tuning (SIT), with both manual (translate-then-predict, caption-then-answer) and automatic Seq-Instruct data augmentation pipelines, to train LLMs to decompose and follow multi-step instructions. A new SeqEval benchmark assesses instruction-following across chained tasks, while extensive experiments show SIT improves coding, maths, and open-ended generation, and generalizes across models and datasets. The work advances the capacity of open-source LLMs to tackle complex tasks and provides a practical framework for constructing and evaluating sequential instruction data.

Abstract

Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.
Paper Structure (50 sections, 8 figures, 12 tables)

This paper contains 50 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Construction of sequential instruction data via manual and automatic processes.
  • Figure 2: VQAv2 and GQA results (accuracy, %) for InstructBLIP-Vicuna-7B prompting, IT, and SIT.
  • Figure 3: Quality scores and following rates on different iterations of SeqEval for Llama-3-8B fine-tuned with Alpaca or FlanCoT under IT or SIT. We also report WizardLM and GPT-3.5-Turbo as baselines, which represent alternative data augmentation methods and proprietary models respectively.
  • Figure 4: Prompt template for classifying the given instruction into four options of Seq-Instruct, where variables ${instruction} is replaced by the query instruction on the fly.
  • Figure 5: Prompt template for classifying the given instruction into four options of Seq-Instruct, where variables ${instruction} is replaced by the query instruction on the fly.
  • ...and 3 more figures