Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning
Pin-Jie Lin, Miaoran Zhang, Marius Mosbach, Dietrich Klakow
TL;DR
This paper tackles the problem of selecting beneficial intermediate tasks in intermediate-task transfer learning, revealing substantial transfer variance across source tasks and training seeds. It systematically evaluates four task-selection approaches within a unified framework and introduces a novel token-wise maximum inner product similarity (Max) for constructing task embeddings, achieving a peak ndcg of 82.5% on transferability prediction. The findings show that task embeddings derived from fine-tuned weights generally outperform embedding-free and text-embedding baselines (up to 3.96% relative gain in ndcg), but struggle on tasks requiring higher-level reasoning, prompting a revisit of embedding constructions. The work also demonstrates that task type is not a reliable predictor and that training seeds heavily influence results, offering practical implications for efficient intermediate-task search and robust transfer in resource-constrained NLP settings.
Abstract
Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.
