Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Pin-Jie Lin; Miaoran Zhang; Marius Mosbach; Dietrich Klakow

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Pin-Jie Lin, Miaoran Zhang, Marius Mosbach, Dietrich Klakow

TL;DR

This paper tackles the problem of selecting beneficial intermediate tasks in intermediate-task transfer learning, revealing substantial transfer variance across source tasks and training seeds. It systematically evaluates four task-selection approaches within a unified framework and introduces a novel token-wise maximum inner product similarity (Max) for constructing task embeddings, achieving a peak ndcg of 82.5% on transferability prediction. The findings show that task embeddings derived from fine-tuned weights generally outperform embedding-free and text-embedding baselines (up to 3.96% relative gain in ndcg), but struggle on tasks requiring higher-level reasoning, prompting a revisit of embedding constructions. The work also demonstrates that task type is not a reliable predictor and that training seeds heavily influence results, offering practical implications for efficient intermediate-task search and robust transfer in resource-constrained NLP settings.

Abstract

Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

TL;DR

Abstract

Paper Structure (47 sections, 7 equations, 7 figures, 8 tables)

This paper contains 47 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related Work
Background
Intermediate-Task Transfer Learning
Parameter-Efficient Fine-Tuning via Soft Prompts
Intermediate-Task Selection Methods
Embedding-Free Methods
Random selection (Random)
Data size (Size)
Embedding Methods
Sentence embeddings (SEmb)
Prompt similarity (Feature)
Systematic Evaluation of Task Selection Methods
Experimental Setup
Datasets.
...and 32 more sections

Figures (7)

Figure 1: Our proposed method, maximum inner product search, is based on pairwise token similarity. Left: Given a target task (e.g., CB), we obtain the maximum token-wise similarity scores between the target and the source tasks for each embedding position. Right: We select the source task with the highest mean of maximum similarity scores.
Figure 2: Left: Intermediate-task transfer learning performs sequentially learning on the source task followed by fine-tuning on the target task. Right: Task selection is a process where given a target task, the goal is to identify the most beneficial task for transfer by searching over a set of source tasks through its task embedding. The selection process relies on a similarity metric to measure the transferability of tasks or datasets.
Figure 3: Relative transfer performance across ten downstream tasks with 390 intermediate-task trained models (13 source $\times$ 10 target tasks $\times$ 3 seeds). Each violin plot illustrates the distribution of performance on the x-axis, with each dot denoting the relative improvement or deterioration compared to the no-transfer baseline on the y-axis. Tasks are arranged in descending order of the training sample sizes.
Figure 4: Projecting prompt tokens of the best, 2nd-best, and worst-performing intermediate-tasks for (\ref{['mycopa']}) COPA and (\ref{['mycopa']}) CB using t-SNE. We observe that prompt tokens from beneficial tasks are distributed more closely to the tokens of no-transfer prompt tuning.
Figure 5: Task prediction performances (average nDCG scores) of three types of task embeddings.
...and 2 more figures

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

TL;DR

Abstract

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)