Table of Contents
Fetching ...

Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

Yingqian Min, Kun Zhou, Dawei Gao, Wayne Xin Zhao, He Hu, Yaliang Li

TL;DR

Data-CUBE tackles interference in multi-task instruction-based sentence representation learning by introducing a two-level data curriculum. The method uses simulated annealing to solve a traveling salesman problem for optimal task ordering and sorts instances within each task from easy to difficult using a discriminability score, both aimed at reducing cross-task and cross-instance conflicts. Experiments on 28 downstream tasks within MTEB show that Data-CUBE yields consistent performance gains with less data and smaller batches than many baselines. The approach is model- and data-agnostic, scalable, and accelerates convergence while mitigating underfitting across diverse tasks.

Abstract

Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training and convergence of the model. To address it, we propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training, to minimize the interference risks from the two views. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk, which is exactly the traveling salesman problem, hence we utilize a simulated annealing algorithm to find its solution. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training. Experiments on MTEB sentence representation evaluation tasks show that our approach can boost the performance of state-of-the-art methods. Our code and data are publicly available at the link: \url{https://github.com/RUCAIBox/Data-CUBE}.

Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning

TL;DR

Data-CUBE tackles interference in multi-task instruction-based sentence representation learning by introducing a two-level data curriculum. The method uses simulated annealing to solve a traveling salesman problem for optimal task ordering and sorts instances within each task from easy to difficult using a discriminability score, both aimed at reducing cross-task and cross-instance conflicts. Experiments on 28 downstream tasks within MTEB show that Data-CUBE yields consistent performance gains with less data and smaller batches than many baselines. The approach is model- and data-agnostic, scalable, and accelerates convergence while mitigating underfitting across diverse tasks.

Abstract

Recently, multi-task instruction tuning has been applied into sentence representation learning, which endows the capability of generating specific representations with the guidance of task instruction, exhibiting strong generalization ability on new tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training and convergence of the model. To address it, we propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training, to minimize the interference risks from the two views. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk, which is exactly the traveling salesman problem, hence we utilize a simulated annealing algorithm to find its solution. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training. Experiments on MTEB sentence representation evaluation tasks show that our approach can boost the performance of state-of-the-art methods. Our code and data are publicly available at the link: \url{https://github.com/RUCAIBox/Data-CUBE}.
Paper Structure (30 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a)Example of the task and instance interference. The distance reflects task similarity, and the shades of oranges represent the difficulty level. (b)The underfitting degrees of all training tasks. We categorize all tasks into three degrees: severe (>80%), moderate (>50% but <80%), and mild (<50%), according to the ratio of instances whose positives and negatives are not clearly distinguished (margin<0.05).
  • Figure 2: The proportion of underfitting instances within different tasks. We show the comparison between INSTRUCTOR and fine-tuned INSTRUCTOR with Data-CUBE.
  • Figure 3: Illustration of Data-CUBE: Task-level Curriculum rearranges the task orders from similar to dissimilar using Simulated Annealing and Instance-level Curriculum reorganizes the instances within each task from easy to difficult.
  • Figure 4: Performance variation curve on the STS tasks during the training process.