Optimized Speculative Sampling for GPU Hardware Accelerators

Dominik Wagner; Seanie Lee; Ilja Baumann; Philipp Seeberger; Korbinian Riedhammer; Tobias Bocklet

Optimized Speculative Sampling for GPU Hardware Accelerators

Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet

TL;DR

To accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid and results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy.

Abstract

In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

Optimized Speculative Sampling for GPU Hardware Accelerators

TL;DR

Abstract

Paper Structure (43 sections, 7 equations, 5 figures, 8 tables)

This paper contains 43 sections, 7 equations, 5 figures, 8 tables.

Introduction
Related work
Method
Preliminaries
Speculative sampling.
GPU memory and execution model.
Acceleration of speculative sampling
Exact optimization
Approximated optimization
Bottleneck of softmax.
Sigmoid approximation.
Experiments
Experimental setup
Datasets and metrics.
Hyperparameters.
...and 28 more sections

Figures (5)

Figure 1: Overview of our exact optimization approach. We compute most of the results required for speculative sampling in parallel using fast SRAM to read and write intermediate results. We maximize the number of threads per block to run parallel computation on as many elements as possible without exhausting the available on-chip memory.
Figure 2: Overview of the computations within each thread block for sigmoid approximation. Each set of logits is scaled by a minimum constant $\alpha$ and a maximum constant $\beta$. Sigmoid activations $\sigma$ are then computed and stored in SRAM for each segment of draft and target logits. Subsequently, the intermediate values $\hat{f}_k(x)$, $\hat{a}_k(x)$, $\hat{b}_k$, and $\hat{\tau}_{c_k}(x)$ are computed analogous to \ref{['fig:approach']}. The resulting outputs are then used to update $\hat{\tau}_c(x)$, $\hat{a}(x)$, and $\hat{b}$ in HBM.
Figure 3: Average execution time of the speculative sampling algorithm per decoding step for varying initial $\gamma$ values on randomly sampled subsets (10%) of Xsum and CV16 test sets.
Figure 4: Peak memory usage (HBM) on randomly sampled 10% of the Xsum test set for varying initial values of $\gamma$.
Figure 5: Peak memory usage (HBM) on randomly sampled 10% of the CV16 test set for varying initial values of $\gamma$.

Optimized Speculative Sampling for GPU Hardware Accelerators

TL;DR

Abstract

Optimized Speculative Sampling for GPU Hardware Accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (5)