Table of Contents
Fetching ...

PROFIT: A Specialized Optimizer for Deep Fine Tuning

Anirudh S Chakravarthy, Shuai Kyle Zheng, Xin Huang, Sachithra Hemachandra, Xiao Zhang, Yuning Chai, Zhao Chen

TL;DR

PROFIT addresses catastrophic forgetting during fine-tuning by treating fine-tuning as proximal temporal multitask optimization. It introduces a modular optimizer wrapper that runs a brief reference optimization from a converged baseline, computes the displacement to the baseline, orthogonally projects the new-task gradient to align with the old equilibrium, and then applies the main update along this orthogonal direction. Theoretical results indicate PROFIT decreases old-task loss and maintains stable points inherited from traditional optimizers, with near-linear loss surfaces posing rare failure modes. Empirically, PROFIT improves performance across image classification, VTAB-1K, visual-language modeling, and autonomous driving motion prediction, often outperforming standard fine-tuning baselines while preserving old-task performance. The approach is lightweight to implement, modular, and readily integrable into existing training pipelines, offering a practical tool for robust fine-tuning.

Abstract

The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.

PROFIT: A Specialized Optimizer for Deep Fine Tuning

TL;DR

PROFIT addresses catastrophic forgetting during fine-tuning by treating fine-tuning as proximal temporal multitask optimization. It introduces a modular optimizer wrapper that runs a brief reference optimization from a converged baseline, computes the displacement to the baseline, orthogonally projects the new-task gradient to align with the old equilibrium, and then applies the main update along this orthogonal direction. Theoretical results indicate PROFIT decreases old-task loss and maintains stable points inherited from traditional optimizers, with near-linear loss surfaces posing rare failure modes. Empirically, PROFIT improves performance across image classification, VTAB-1K, visual-language modeling, and autonomous driving motion prediction, often outperforming standard fine-tuning baselines while preserving old-task performance. The approach is lightweight to implement, modular, and readily integrable into existing training pipelines, offering a practical tool for robust fine-tuning.

Abstract

The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.

Paper Structure

This paper contains 38 sections, 4 theorems, 5 figures, 14 tables, 1 algorithm.

Key Result

Theorem 3.1

(Correctness on old data) Take a model $\mathcal{M}(\mathbf{x}; \theta)$ converged on data $\mathcal{X}_{\text{old}}$ with loss $L_{\text{old}}$, which we would now like to fine-tune on data $\mathcal{X}_{\text{new}}$ with loss $L_{\text{new}}$. Suppose that $\mathbf{O}^{(\text{ref})}$ takes $\theta

Figures (5)

  • Figure 1: Schematic of PROFIT. Standard fine-tuning (middle) takes successive steps away from a good starting state $\theta_0$. PROFIT (right): (1) take $n_{\text{ref}}$ small reference steps with $O^{(\mathrm{ref})}$ to obtain a displaced state $\theta_1$; (2) compute the displacement $\Delta=\theta_1-\theta_0;$ (3) orthogonalize the new-batch gradient $g:= \theta_2 - \theta_1$ to $-\Delta$; (4) restore $\theta\leftarrow\theta_0$ and then apply the main optimizer $O$ along the orthogonalized direction. Here, r denotes ‘reference’. See Alg. \ref{['algo:optimizer']} for details.
  • Figure 2: Results for PROFIT on Waymo Open Motion Dataset. (a) Training error curves for FDE at 3s and 8s (top) and ADE at 3s and 8s (bottom). PROFIT outperforms both fine-tuning baselines by a sizable margin. (b) Visualizations of motion prediction outputs for both the baseline fine-tune model (top) and PROFIT (bottom). Trajectory ground truth is shown as a shaded bar and denser lines represent more confident predictions. PROFIT (bottom) produces more confident predictions that align better with the ground truth (shaded bar) compared to the baseline (top). Best viewed in color.
  • Figure 3: Toy example visualizations. The top row shows ground truth distributions for the original, new, and combined datasets, while the bottom row depicts model predictions for different training strategies: training on original data only, head-only fine-tuning, full model fine-tuning, and using the proposed PROFIT optimizer for full model fine-tuning. This highlights the PROFIT optimizer’s effectiveness in retaining old task knowledge while adapting to new tasks. Best viewed in color.
  • Figure 4: Training and validation losses for fine-tuning ViT-Small on CIFAR100.
  • Figure 5: We compare the baseline with PROFIT on an example from DriveLM. The model fine-tuned with PROFIT is able to perceive the traffic light and black sedan in the scene, while the baseline (fine-tuned with AdamW) does not detect the traffic light and hallucinates the presence of a white truck. Consequently, the baseline suggests running the red light, while our method follows traffic rules by staying stationary. Best viewed in color.

Theorems & Definitions (7)

  • Theorem 3.1
  • proof : Proof
  • Theorem 3.2
  • proof : Proof
  • Corollary 3.2
  • Theorem 3.3
  • proof