Table of Contents
Fetching ...

TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

TL;DR

TSDS tackles data selection for task-specific finetuning by casting the problem as a regularized optimal-transport optimization that aligns the selected data distribution to a small set of query examples while promoting diversity. It introduces two regularizers, $G_{\infty}$ and $G_{TV}$ for tractable closed-form solutions and a KDE-based regularizer $G_{\text{KDE}}$ to robustly handle near-duplicates, with theoretical guarantees for the KDE-based allocation. The framework yields practical, model-agnostic data-selection algorithms (KNN-Uniform and KNN-KDE) that leverage approximate nearest-neighbor search (e.g., FAISS) to scale to large corpora. Empirically, TSDS achieves strong gains on instruction tuning and domain-pretraining with as little as 1% of the candidate data, demonstrates robustness to duplicates, and reports feasible preprocessing times, indicating real-world data-efficiency benefits for large language models. The work thereby provides a principled, scalable approach to task-focused data curation that complements existing heuristic or performance-surrogate selection methods.

Abstract

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

TSDS: Data Selection for Task-Specific Model Finetuning

TL;DR

TSDS tackles data selection for task-specific finetuning by casting the problem as a regularized optimal-transport optimization that aligns the selected data distribution to a small set of query examples while promoting diversity. It introduces two regularizers, and for tractable closed-form solutions and a KDE-based regularizer to robustly handle near-duplicates, with theoretical guarantees for the KDE-based allocation. The framework yields practical, model-agnostic data-selection algorithms (KNN-Uniform and KNN-KDE) that leverage approximate nearest-neighbor search (e.g., FAISS) to scale to large corpora. Empirically, TSDS achieves strong gains on instruction tuning and domain-pretraining with as little as 1% of the candidate data, demonstrates robustness to duplicates, and reports feasible preprocessing times, indicating real-world data-efficiency benefits for large language models. The work thereby provides a principled, scalable approach to task-focused data curation that complements existing heuristic or performance-surrogate selection methods.

Abstract

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

Paper Structure

This paper contains 30 sections, 3 theorems, 21 equations, 3 figures, 12 tables, 3 algorithms.

Key Result

Theorem 3.1

Given $\boldsymbol{d} \in \mathbb{R}_{\geq 0}^{M \times N}$ where $N > 1$, consider Problem eq:opt with $G(\boldsymbol{\gamma}) = G_\infty(\boldsymbol{\gamma}) = M \max_{i \in M, j \in N} |\gamma_{ij} - \frac{1}{MN}|$. For all $i \in [M]$, let $j_{1}^{i}, \dots, j_{N}^{i}$ be a reordering of $[N]$ s

Figures (3)

  • Figure 1: An example of the optimal probability transports under different regularization terms. We consider 1 query example $q$ and 5 candidates $x_1, \dots, x_{5}$ embedded in a 2-dimensional space. Assume that the candidates that form a cluster (i.e., $x_3, x_4, x_5$) have a density estimate of $\frac{3}{2}$ each and the others have a density estimate of $1$.
  • Figure 2: F1 scores of the downstream tasks under different duplication settings.
  • Figure 3: Performance of KNN-KDE when $\alpha$ varies. The error bar shows the standard deviation.

Theorems & Definitions (4)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem A.1
  • proof