Table of Contents
Fetching ...

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Jun Luo, Yifan Zhu, Tao Feng

TL;DR

This paper tackles the instability of fully replacing First-Order optimization with Zeroth-Order optimization in PEFT-based Vision-Language Continual Learning. It shows that naive full-ZO adoption destabilizes training, and proposes a modality-aware, layer-wise hybrid strategy (MoZO) that interleaves ZO and FO across branches and layers, with gradient-sign normalization and vision perturbation control to balance multi-modal updates. Across CIFAR, TinyImageNet, and ImageNet-R, the approach yields strong performance and memory efficiency, achieving state-of-the-art results on four benchmarks. The work demonstrates that selective ZO deployment, particularly in the language branch and in interleaved layer configurations, can effectively escape local minima while preserving training stability, providing practical guidelines for ZO-FO hybrids in VLCL.

Abstract

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

TL;DR

This paper tackles the instability of fully replacing First-Order optimization with Zeroth-Order optimization in PEFT-based Vision-Language Continual Learning. It shows that naive full-ZO adoption destabilizes training, and proposes a modality-aware, layer-wise hybrid strategy (MoZO) that interleaves ZO and FO across branches and layers, with gradient-sign normalization and vision perturbation control to balance multi-modal updates. Across CIFAR, TinyImageNet, and ImageNet-R, the approach yields strong performance and memory efficiency, achieving state-of-the-art results on four benchmarks. The work demonstrates that selective ZO deployment, particularly in the language branch and in interleaved layer configurations, can effectively escape local minima while preserving training stability, providing practical guidelines for ZO-FO hybrids in VLCL.

Abstract

Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

Paper Structure

This paper contains 13 sections, 5 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of our study. The language and vision encoders of CLIP are frozen, only the trainable units attached to each layer is performed to parameters update. To sum up, we systematically explores how ZO optimization operates in VLCL, including branches (Dual, Vision, or Language) and layers (w/ Hop-odd, w/ Hop-even, w/ Prefix (six) and w/ Suffix (six)).
  • Figure 2: How ZO optimization affects loss convergence of VLCL across different branches (CLIP). w/ ZO denotes the branch (Dual, Vision, or Language) where ZO optimization is applied.
  • Figure 3: Analyzing convergence behavior of VLCL in Hop-odd across Dual (Du.), Vision (Vis.), Language (Lan.).
  • Figure 4: How ZO optimization affects gradient variance across layers in VLCL.
  • Figure 5: Analyzing gradient variance of VLCL in Hop-odd across Dual, Vision, Language.
  • ...and 6 more figures