Table of Contents
Fetching ...

Does the Order of Fine-tuning Matter and Why?

Qihong Chen, Jiawei Li, Hyunjae Suh, Lianghao Jiang, Zheng Zhou, Jingze Chen, Jiri Gesi, Iftekhar Ahmed

TL;DR

This work investigates whether the order of fine-tuning multiple intermediate SE tasks influences target-task performance. Using CodeBERT and CodeXGLUE-SE tasks, it exhaustively evaluates all permutations of two to four tasks (60 chains) against baselines across 10-fold CV, revealing that task ordering can yield up to ~6% performance gains or ~4% losses. The authors analyze explanatory factors across data characteristics (syntactic/semantic similarity, dataset size), task relationships (affinity), and model behavior (probing and attention), as well as time-cost trade-offs to identify cost-effective orders. The findings provide practical guidance for SE researchers and practitioners on selecting intermediate-task sequences under resource constraints and motivate future work to develop SE-task similarity metrics and broader task sets.

Abstract

To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.

Does the Order of Fine-tuning Matter and Why?

TL;DR

This work investigates whether the order of fine-tuning multiple intermediate SE tasks influences target-task performance. Using CodeBERT and CodeXGLUE-SE tasks, it exhaustively evaluates all permutations of two to four tasks (60 chains) against baselines across 10-fold CV, revealing that task ordering can yield up to ~6% performance gains or ~4% losses. The authors analyze explanatory factors across data characteristics (syntactic/semantic similarity, dataset size), task relationships (affinity), and model behavior (probing and attention), as well as time-cost trade-offs to identify cost-effective orders. The findings provide practical guidance for SE researchers and practitioners on selecting intermediate-task sequences under resource constraints and motivate future work to develop SE-task similarity metrics and broader task sets.

Abstract

To improve the performance on a target task, researchers have fine-tuned language models with an intermediate task before the target task of interest. However, previous works have focused on the pre-trained language models and downstream tasks in Natural Language Processing (NLP) and considered only one intermediate task. The effect of fine-tuning multiple intermediate tasks and their ordering on target task performance has not been fully explored in Software Engineering. In this study, we perform the first empirical study on analyzing the impact of task ordering on target task performance. Experimental results show that there is an impact of task ordering on target task performance by up to 6% of performance gain and up to 4% of performance loss. To explain such an impact, we consider a variety of potential factors, including the characteristics of dataset (syntactic similarity and semantic similarity analysis, dataset size), model (probing task and attention analysis), and task (task affinity analysis). Our study provides Software Engineering researchers and practitioners with insights into the effect of task orderings and how to select the one that is cost-effective while achieving the best performance gain.
Paper Structure (23 sections, 2 equations, 4 figures, 8 tables)

This paper contains 23 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Single Intermediate Task Fine-tuning VS multiple Intermediate Task Fine-tuning
  • Figure 2: Overview of Factors Explored in this Study
  • Figure 3: Attention weight experimental results on syntax tokens
  • Figure 4: Attention weight experimental results on abstract syntax tree elements