PROFIT: A Specialized Optimizer for Deep Fine Tuning
Anirudh S Chakravarthy, Shuai Kyle Zheng, Xin Huang, Sachithra Hemachandra, Xiao Zhang, Yuning Chai, Zhao Chen
TL;DR
PROFIT addresses catastrophic forgetting during fine-tuning by treating fine-tuning as proximal temporal multitask optimization. It introduces a modular optimizer wrapper that runs a brief reference optimization from a converged baseline, computes the displacement to the baseline, orthogonally projects the new-task gradient to align with the old equilibrium, and then applies the main update along this orthogonal direction. Theoretical results indicate PROFIT decreases old-task loss and maintains stable points inherited from traditional optimizers, with near-linear loss surfaces posing rare failure modes. Empirically, PROFIT improves performance across image classification, VTAB-1K, visual-language modeling, and autonomous driving motion prediction, often outperforming standard fine-tuning baselines while preserving old-task performance. The approach is lightweight to implement, modular, and readily integrable into existing training pipelines, offering a practical tool for robust fine-tuning.
Abstract
The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.
