Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

Hilal AlQuabeh; William de Vazelhes; Bin Gu

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

Hilal AlQuabeh, William de Vazelhes, Bin Gu

TL;DR

The paper addresses the scalability of online pairwise learning by introducing a lightweight kernelized online gradient descent that handles non-iid data. It combines a moving-average gradient with a randomly selected past example and approximates the kernel via random Fourier features, achieving sublinear regret with complexity $O(T)$ for linear models and $O(\frac{D}{d}T)$ for kernelized variants. Theoretical guarantees are provided for both the all-pairs regret and the kernel-approximation error, with practical guidance on feature dimensionality $D$ (down to $\sqrt{T}\log T$ for certain kernels). Empirical results on AUC maximization demonstrate improved performance and faster convergence against state-of-the-art linear and kernel methods, highlighting the approach's scalability and robustness under non-iid data. This work enables efficient, kernelized, non-iid online pairwise learning with tangible gains in real-world datasets.

Abstract

Pairwise learning, an important domain within machine learning, addresses loss functions defined on pairs of training examples, including those in metric learning and AUC maximization. Acknowledging the quadratic growth in computation complexity accompanying pairwise loss as the sample size grows, researchers have turned to online gradient descent (OGD) methods for enhanced scalability. Recently, an OGD algorithm emerged, employing gradient computation involving prior and most recent examples, a step that effectively reduces algorithmic complexity to $O(T)$, with $T$ being the number of received examples. This approach, however, confines itself to linear models while assuming the independence of example arrivals. We introduce a lightweight OGD algorithm that does not require the independence of examples and generalizes to kernel pairwise learning. Our algorithm builds the gradient based on a random example and a moving average representing the past data, which results in a sub-linear regret bound with a complexity of $O(T)$. Furthermore, through the integration of $O(\sqrt{T}{\log{T}})$ random Fourier features, the complexity of kernel calculations is effectively minimized. Several experiments with real-world datasets show that the proposed technique outperforms kernel and linear algorithms in offline and online scenarios.

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

TL;DR

for linear models and

for kernelized variants. Theoretical guarantees are provided for both the all-pairs regret and the kernel-approximation error, with practical guidance on feature dimensionality

(down to

for certain kernels). Empirical results on AUC maximization demonstrate improved performance and faster convergence against state-of-the-art linear and kernel methods, highlighting the approach's scalability and robustness under non-iid data. This work enables efficient, kernelized, non-iid online pairwise learning with tangible gains in real-world datasets.

Abstract

, with

being the number of received examples. This approach, however, confines itself to linear models while assuming the independence of example arrivals. We introduce a lightweight OGD algorithm that does not require the independence of examples and generalizes to kernel pairwise learning. Our algorithm builds the gradient based on a random example and a moving average representing the past data, which results in a sub-linear regret bound with a complexity of

. Furthermore, through the integration of

random Fourier features, the complexity of kernel calculations is effectively minimized. Several experiments with real-world datasets show that the proposed technique outperforms kernel and linear algorithms in offline and online scenarios.

Paper Structure (17 sections, 11 theorems, 34 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 17 sections, 11 theorems, 34 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Problem Setting
Assumptions
Methodology
Regret Analysis
Regret in the Approximated Space $\bar{\mathcal{H}}$
Approximation Error for Random Features
Related Work
Experiments
Experimental Setup
Compared Algorithms.
Implementation.
Experimental Results and Analysis
Conclusion
Appendix
...and 2 more sections

Key Result

Theorem 1

Let $\{z_t\in \mathcal{Z}\}_{t=1}^T$ be sequentially accessed by Algorithm alg:FPOGD. Let $D$ denote the number of random Fourier features in the kernel mapping from the original space $\mathcal{X} \subset \mathbb{R}^d$. Let $\eta$ be first step size, $\gamma=O(\Gamma M_t \eta)$ the second step size where, $\epsilon$ denotes the kernel approximation error, $\sigma$ signifies the kernel width, $\Ga

Figures (1)

Figure 1: The AUC vs. time comparison of the algorithms in different datasets showing superior performance of the proposed method.

Theorems & Definitions (17)

Theorem 1
Remark 1
Lemma 1
Theorem 2
Corollary 1
Theorem 3
Remark 2
Lemma 2
proof
Lemma 3
...and 7 more

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

TL;DR

Abstract

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (17)