Table of Contents
Fetching ...

Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

TL;DR

This paper tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method.

Abstract

Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.

Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains

TL;DR

This paper tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method.

Abstract

Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.

Paper Structure

This paper contains 55 sections, 2 theorems, 27 equations, 13 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Given a model $f(\cdot; \theta_t)$ with NTK $\Theta (\cdot, \cdot; \theta_t)$ whose training dynamics are governed by a gradient flow $\dot{f}(\cdot;\theta_t) = -\eta \Theta(\cdot, \cdot; \theta_t) \gamma(t)$, where $\gamma (t)=\nabla_f \mathcal{L}(f(\cdot; \theta_t))$ and $\eta$ is the learning rat where $u(t) \coloneqq \int_0^t a^*(\tau) d\tau$, $a^*(t) = \frac{\langle \Theta (\cdot, \cdot; \the

Figures (13)

  • Figure 1: Performance of Llama3-8B-Instruct evaluated on medical, financial, legal, and psychological tasks. Each task is augmented with 9K auxiliary samples selected by Random, LESS, and NTK-Selector from Cot Collection based on 1K domain samples.
  • Figure 2: (a) Frobenius cosine similarity between NTK of Llama3-8B-Instruct and Qwen3-8B during LoRA-based instruction tuning towards financial sentiment analysis task. (b) Correlation between the exact NTK values and Jacobian-free approximation across a diverse set of input pairs.
  • Figure 3: Accuracy on TFNS task with 1K domain samples and various numbers of auxiliary samples, where the total number is enriched by $\times 2$, $\times 5$, $\times 10$, and $\times 20$.
  • Figure 4: Accuracy on TFNS task with 100, 200, 500, 1K, 2K domain samples enriched by $\times 2$, $\times 5$, $\times 10$, and $\times 20$.
  • Figure 5: Average performance of NTK-Selector using different projected dimensions as 1024, 2048, 4096, and 8192.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Definition 1: NTK-like
  • Theorem 1
  • Definition 2: Jacobian-free NTK Approximation
  • Theorem 2
  • proof
  • proof