An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
Yeqi Gao, Zhao Song, Junze Yin
TL;DR
This work introduces a rescaled softmax regression problem for attention in large language models, where the function $u(x)$ can be $\exp(Ax)$, $\cosh(Ax)$, or $\sinh(Ax)$ and the objective minimizes $L_u(x) = \tfrac{1}{2}\|u(x) - \langle u(x), \mathbf{1}_n\rangle b\|_2^2$. It develops a randomized, subquadratic algorithm based on an approximate Newton method, leveraging a positive definite and Lipschitz Hessian, to efficiently solve this regression with near-linear dependence on the input sparsity and near-optimal matrix-multiplication cost. The framework shows the Hessian properties, provides an $(l,M)$-good loss notion for convergence, and extends to a regularized variant and to in-context learning via Lipschitz bounds with respect to the data matrix $A$. The results suggest potential speedups for attention computations in transformer models with minimal performance loss, and the approach generalizes across multiple hyperbolic activation functions, offering a unified, scalable optimization route for attention-related regression tasks.
Abstract
Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come. Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression. In this paper, we propose an iterative algorithm to solve a rescaled version of the slightly different formulation of the softmax regression problem that arises in attention mechanisms of large language models. Specifically, we consider minimizing the squared loss between a certain function, which can be either the exponential function, hyperbolic sine function, or hyperbolic cosine function, and its inner product with a target $n$-dimensional vector $b$, scaled by the normalization term. This ``rescaled softmax regression'' differs from classical softmax regression in the location of the normalization factor. The efficiency and generalizability of this framework to multiple hyperbolic functions make it relevant for optimizing attention mechanisms. The analysis also leads to a corollary bounding solution changes under small perturbations for in-context learning. Limitations and societal impact are discussed.
