Table of Contents
Fetching ...

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

Pin-Jie Lin, Miaoran Zhang, Marius Mosbach, Dietrich Klakow

TL;DR

This paper tackles the problem of selecting beneficial intermediate tasks in intermediate-task transfer learning, revealing substantial transfer variance across source tasks and training seeds. It systematically evaluates four task-selection approaches within a unified framework and introduces a novel token-wise maximum inner product similarity (Max) for constructing task embeddings, achieving a peak ndcg of 82.5% on transferability prediction. The findings show that task embeddings derived from fine-tuned weights generally outperform embedding-free and text-embedding baselines (up to 3.96% relative gain in ndcg), but struggle on tasks requiring higher-level reasoning, prompting a revisit of embedding constructions. The work also demonstrates that task type is not a reliable predictor and that training seeds heavily influence results, offering practical implications for efficient intermediate-task search and robust transfer in resource-constrained NLP settings.

Abstract

Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning

TL;DR

This paper tackles the problem of selecting beneficial intermediate tasks in intermediate-task transfer learning, revealing substantial transfer variance across source tasks and training seeds. It systematically evaluates four task-selection approaches within a unified framework and introduces a novel token-wise maximum inner product similarity (Max) for constructing task embeddings, achieving a peak ndcg of 82.5% on transferability prediction. The findings show that task embeddings derived from fine-tuned weights generally outperform embedding-free and text-embedding baselines (up to 3.96% relative gain in ndcg), but struggle on tasks requiring higher-level reasoning, prompting a revisit of embedding constructions. The work also demonstrates that task type is not a reliable predictor and that training seeds heavily influence results, offering practical implications for efficient intermediate-task search and robust transfer in resource-constrained NLP settings.

Abstract

Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.
Paper Structure (47 sections, 7 equations, 7 figures, 8 tables)

This paper contains 47 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Our proposed method, maximum inner product search, is based on pairwise token similarity. Left: Given a target task (e.g., CB), we obtain the maximum token-wise similarity scores between the target and the source tasks for each embedding position. Right: We select the source task with the highest mean of maximum similarity scores.
  • Figure 2: Left: Intermediate-task transfer learning performs sequentially learning on the source task followed by fine-tuning on the target task. Right: Task selection is a process where given a target task, the goal is to identify the most beneficial task for transfer by searching over a set of source tasks through its task embedding. The selection process relies on a similarity metric to measure the transferability of tasks or datasets.
  • Figure 3: Relative transfer performance across ten downstream tasks with 390 intermediate-task trained models (13 source $\times$ 10 target tasks $\times$ 3 seeds). Each violin plot illustrates the distribution of performance on the x-axis, with each dot denoting the relative improvement or deterioration compared to the no-transfer baseline on the y-axis. Tasks are arranged in descending order of the training sample sizes.
  • Figure 4: Projecting prompt tokens of the best, 2nd-best, and worst-performing intermediate-tasks for (\ref{['mycopa']}) COPA and (\ref{['mycopa']}) CB using t-SNE. We observe that prompt tokens from beneficial tasks are distributed more closely to the tokens of no-transfer prompt tuning.
  • Figure 5: Task prediction performances (average nDCG scores) of three types of task embeddings.
  • ...and 2 more figures