Table of Contents
Fetching ...

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

Prabodh Katti, Sangwoo Park, Bipin Rajendran, Osvaldo Simeone

TL;DR

On-device fine-tuning under fixed on-chip memory budgets is challenging with BP due to activation and optimizer state storage. The paper advocates memory-efficient zeroth-order optimization (MeZO), which estimates gradients through forward evaluations and avoids storing activations. The authors provide a theoretical memory-analysis showing that MeZO can accommodate substantially larger models for long context windows and validate these results with edge-device experiments where MeZO attains higher accuracy given sufficient wall-clock time. This work suggests MeZO as a practical route for agentive edge AI and continual learning, though further work is needed to tailor it to neuromorphic and highly sparse architectures.

Abstract

On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization

TL;DR

On-device fine-tuning under fixed on-chip memory budgets is challenging with BP due to activation and optimizer state storage. The paper advocates memory-efficient zeroth-order optimization (MeZO), which estimates gradients through forward evaluations and avoids storing activations. The authors provide a theoretical memory-analysis showing that MeZO can accommodate substantially larger models for long context windows and validate these results with edge-device experiments where MeZO attains higher accuracy given sufficient wall-clock time. This work suggests MeZO as a practical route for agentive edge AI and continual learning, though further work is needed to tailor it to neuromorphic and highly sparse architectures.

Abstract

On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.

Paper Structure

This paper contains 8 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: On-device fine-tuning: (a) backpropagation (BP)-based fine-tuning requires significantly more memory than inference (see Fig. \ref{['fig:BPvsMeZO']}), limiting model size on the device; while (b) MeZO-based fine-tuning malladi2023fine only carries out inference steps, enabling deployment of significantly larger, more capable models at the edge.
  • Figure 2: Illustration of a basic decoder-only Transformer model, highlighting the major contributors to parameter count. Non-parametric operations such as RoPE encoding 10.1016/j.neucom.2023.127063 and parameter count for minor contributors, such as normalization layers, are not shown for clarity.
  • Figure 3: Memory requirements for BP- and MeZO-based fine-tuning, including conventional BP (top row) and BP with checkpointing (bottom row). The solid lines and the primary y-axis correspond to the total memory consumed by BP and MeZO using \ref{['eq:mbpsgd']}, while the secondary y-axis with a green colored dashed line represents the MeZO over BP ratio of memory requirements.
  • Figure 4: On-device fine-tuning for Boolq data set clark2019boolq as a function of fine-tuning wall-clock time [s]. We consider MeZO with Llama2-7B and LlaMa2-13B, while BP adopts GPT2-medium model. The batch size is $B=8$. All models require a similar memory consumption of around 17 GB according to the analysis in Sec. \ref{['sec:memory']} when setting $L'/L=0.15$ for Llama2-13B and the $L'/L=0.41$ for Llama2-7B.