Table of Contents
Fetching ...

FLOPS: Forward Learning with OPtimal Sampling

Tao Ren, Zishi Zhang, Jinyang Jiang, Guanghao Li, Zeliang Zhang, Mingqian Feng, Yijie Peng

TL;DR

This work derives a novel plug-and-play query allocator with minimal parameters, which significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.

Abstract

Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.

FLOPS: Forward Learning with OPtimal Sampling

TL;DR

This work derives a novel plug-and-play query allocator with minimal parameters, which significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.

Abstract

Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.
Paper Structure (23 sections, 2 theorems, 46 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 2 theorems, 46 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that the Assumption gaussian_assump holds, maximizing the lower bound $LB_t(\bm{\lambda})$ over $\bm{\lambda}\in\Lambda$ is equivalent to minimizing over $\bm{\lambda}\in\Lambda$, i.e.,

Figures (6)

  • Figure 1: Illustration of allocating the query budget of the forward learning paradigm. As shown in (a), previous methods equally allocate the query across different data. Our method, as shown in (b), adaptively allocate the queries under theoretically guaranteed optimality.
  • Figure 2: illustration paradigm of fine-tuning the prompt for black-box vision language model.
  • Figure 3: illustration alignment between foundation model for video understanding.
  • Figure 4: Ablation study on the effect of different allocators and estimation difficulty at different layers. We show the cosine similarity between the estimated and true gradients. We estimate the gradient of the Key matrix in multi-head attention (MHA) and the first linear layer in the feed-forward network (FFN)). Layer 1 is adjacent to the embedding and the layer 12 is adjacent to the classification head.
  • Figure 5: Wall clock time for feature alignment between different methods.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 1: Equivalent Objective
  • Theorem 2: Theoretical Improvement
  • proof
  • proof
  • proof