Table of Contents
Fetching ...

MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

Zhen Zhang, Yifan Yang, Kai Zhen, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

TL;DR

MaZO introduces a novel masked zeroth-order optimization framework for multi-task fine-tuning of large language models, addressing the high gradient variance and task-conflict challenges that arise in ZO settings. By computing a weight importance score and applying a multi-task weight update mask, MaZO concentrates updates on a subset of critical parameters, reducing dimensionality and mitigating inter-task interference. The approach yields state-of-the-art performance on LLaMA-2-7B and Mistral-7B under ZO optimization, including strong results with LoRA-based fine-tuning, and demonstrates faster convergence with modest memory overhead. This parameter-centric strategy offers practical improvements for resource-constrained MT-LM fine-tuning and suggests potential extensions to other optimization frameworks beyond ZO.

Abstract

Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.

MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models

TL;DR

MaZO introduces a novel masked zeroth-order optimization framework for multi-task fine-tuning of large language models, addressing the high gradient variance and task-conflict challenges that arise in ZO settings. By computing a weight importance score and applying a multi-task weight update mask, MaZO concentrates updates on a subset of critical parameters, reducing dimensionality and mitigating inter-task interference. The approach yields state-of-the-art performance on LLaMA-2-7B and Mistral-7B under ZO optimization, including strong results with LoRA-based fine-tuning, and demonstrates faster convergence with modest memory overhead. This parameter-centric strategy offers practical improvements for resource-constrained MT-LM fine-tuning and suggests potential extensions to other optimization frameworks beyond ZO.

Abstract

Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.

Paper Structure

This paper contains 44 sections, 49 equations, 5 figures, 4 tables, 4 algorithms.

Figures (5)

  • Figure 1: Radar chart comparing the performance of our MaZO method with other methods on LLaMA-2-7B and Mistral-7B. Larger is better. Shared model means we train the model on one task and test it on all tasks.
  • Figure 2: Diagram of our MaZO method. The weight importance scoring and weight update mask is calculated row-wise. The weight importance for each task is calculated independently, and only from the input and weight.
  • Figure 3: Top-K eigenvalue distribution of the Hessian matrices in multi-task learning and single-task learning. These eigenvalue are normalized by dividing by the maximum value. The slower decay of eigenvalues in multi-task learning suggests a higher effective rank, which contributes to the slower convergence of ZO fine-tuning in multi-task scenarios.
  • Figure 4: The convergence curve of (1) vanilla multi-task ZO fine-tuning with LoRA, (2) MaZO with LoRA.
  • Figure 5: The loss curve with different LoRA rank.