Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Jun Luo, Yifan Zhu, Tao Feng
TL;DR
This paper tackles the instability of fully replacing First-Order optimization with Zeroth-Order optimization in PEFT-based Vision-Language Continual Learning. It shows that naive full-ZO adoption destabilizes training, and proposes a modality-aware, layer-wise hybrid strategy (MoZO) that interleaves ZO and FO across branches and layers, with gradient-sign normalization and vision perturbation control to balance multi-modal updates. Across CIFAR, TinyImageNet, and ImageNet-R, the approach yields strong performance and memory efficiency, achieving state-of-the-art results on four benchmarks. The work demonstrates that selective ZO deployment, particularly in the language branch and in interleaved layer configurations, can effectively escape local minima while preserving training stability, providing practical guidelines for ZO-FO hybrids in VLCL.
Abstract
Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.
