An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Yeqi Gao; Zhao Song; Junze Yin

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Yeqi Gao, Zhao Song, Junze Yin

TL;DR

This work introduces a rescaled softmax regression problem for attention in large language models, where the function $u(x)$ can be $\exp(Ax)$, $\cosh(Ax)$, or $\sinh(Ax)$ and the objective minimizes $L_u(x) = \tfrac{1}{2}\|u(x) - \langle u(x), \mathbf{1}_n\rangle b\|_2^2$. It develops a randomized, subquadratic algorithm based on an approximate Newton method, leveraging a positive definite and Lipschitz Hessian, to efficiently solve this regression with near-linear dependence on the input sparsity and near-optimal matrix-multiplication cost. The framework shows the Hessian properties, provides an $(l,M)$-good loss notion for convergence, and extends to a regularized variant and to in-context learning via Lipschitz bounds with respect to the data matrix $A$. The results suggest potential speedups for attention computations in transformer models with minimal performance loss, and the approach generalizes across multiple hyperbolic activation functions, offering a unified, scalable optimization route for attention-related regression tasks.

Abstract

Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

TL;DR

This work introduces a rescaled softmax regression problem for attention in large language models, where the function

can be

, or

and the objective minimizes

. It develops a randomized, subquadratic algorithm based on an approximate Newton method, leveraging a positive definite and Lipschitz Hessian, to efficiently solve this regression with near-linear dependence on the input sparsity and near-optimal matrix-multiplication cost. The framework shows the Hessian properties, provides an

-good loss notion for convergence, and extends to a regularized variant and to in-context learning via Lipschitz bounds with respect to the data matrix

. The results suggest potential speedups for attention computations in transformer models with minimal performance loss, and the approach generalizes across multiple hyperbolic activation functions, offering a unified, scalable optimization route for attention-related regression tasks.

Abstract

-dimensional vector

, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.

Paper Structure (69 sections, 33 theorems, 128 equations, 1 algorithm)

This paper contains 69 sections, 33 theorems, 128 equations, 1 algorithm.

Introduction
Our contributions.
Our Results
Roadmap.
Related Work
Optimization and Convergence
Learning in-context
Fast Attention Computation
Preliminaries
Notation
General Functions: Definitions
A Basic Mathematical Property
Technique Overview
General Functions
$\frac{\mathrm{d}^2 L}{\mathrm{d} x^2}$ is Positive Definite
...and 54 more sections

Key Result

Theorem 1.3

Let $\epsilon, \delta \in (0, 0.1)$ be the accuracy parameter and the failure probability, respectively. Let $x_0, x^* \in \mathbb{R}^d$ denote the initial point and the optimal solution respectively, $\mathop{\mathrm{nnz}}\nolimits(A)$ denote the number of non-zero entries of $A$, and $\omega\appro time in each iteration, and outputs a vector $\widetilde{x} \in \mathbb{R}^d$ such that

Theorems & Definitions (72)

Definition 1.1: $\ell$-th layer forward computation and attention optimization
Definition 1.2: Rescaled Softmax Regression
Theorem 1.3: Main Result, Informal version of Theorem \ref{['thm:main_formal']}
Definition 3.1
Definition 3.2
Definition 3.3: Loss function $L_u$
Definition 3.4: Rescaled coefficients
Definition 3.5
Definition 3.6
Lemma 5.1: Informal version of Lemma \ref{['lem:hessian_psd_exp']}
...and 62 more

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

TL;DR

Abstract

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (72)