Table of Contents
Fetching ...

Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation

Xiyang Zhang, Yuanhe Tian, Hongzhi Wang, Yan Song

TL;DR

This work tackles catastrophic forgetting during domain adaptation of large language models by exploiting gradient geometry to separate domain learning from general knowledge preservation. It introduces Orthogonal Gradient Selection (OGS), a data-centric method that uses a lightweight Navigator to pre-screen training samples whose gradients align with a general-knowledge anchor, paired with RL-driven selection and a PPO-Lagrangian optimization objective. The approach defines a Safety Subspace $\mathcal{S}_{\perp}$ via an anchor gradient $\mathbf{g}_{ref}$ and uses orthogonality and conflict metrics to guide sample selection, achieving Pareto improvements in domain performance while maintaining general reasoning and increasing training efficiency. The results across medical, legal, and financial domains demonstrate that OGS reduces forgetting, increases data efficiency, and maintains high throughput, making it a practical, scalable solution for safe continual learning in very large models.

Abstract

Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.

Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation

TL;DR

This work tackles catastrophic forgetting during domain adaptation of large language models by exploiting gradient geometry to separate domain learning from general knowledge preservation. It introduces Orthogonal Gradient Selection (OGS), a data-centric method that uses a lightweight Navigator to pre-screen training samples whose gradients align with a general-knowledge anchor, paired with RL-driven selection and a PPO-Lagrangian optimization objective. The approach defines a Safety Subspace via an anchor gradient and uses orthogonality and conflict metrics to guide sample selection, achieving Pareto improvements in domain performance while maintaining general reasoning and increasing training efficiency. The results across medical, legal, and financial domains demonstrate that OGS reduces forgetting, increases data efficiency, and maintains high throughput, making it a practical, scalable solution for safe continual learning in very large models.

Abstract

Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.
Paper Structure (42 sections, 3 theorems, 19 equations, 1 figure, 8 tables, 1 algorithm)

This paper contains 42 sections, 3 theorems, 19 equations, 1 figure, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

The degradation of general capabilities after a domain update step $\Delta \theta = -\eta \nabla \mathcal{L}_{med}(\theta)$ is governed, at the first order, by the inner product of gradients:

Figures (1)

  • Figure 1: The entire intuitive flow of the OGS method, including Navigator Probing, Geometric Selection, Target Fine-tuning three stages, aiming to choose the best data subset.

Theorems & Definitions (5)

  • Theorem 4.1: Gradient Interference
  • Theorem 4.2: First-Order Optimality
  • Proposition 4.3: Asymptotic Efficiency
  • proof
  • proof