Table of Contents
Fetching ...

Scaling Embedding Layers in Language Models

Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang

TL;DR

SCONE introduces a scalable, off-accelerated embedding expansion by learning contextualized $n$-gram embeddings with a separate transformer ($\mathcal{A}_{\mathrm{f\text{-}gram}}$) and caching their outputs for inference via $\mathcal{F}$. By decoupling the $n$-gram embeddings from the token vocabulary, SCONE enables two new scaling axes—more $f$-grams and larger $\mathcal{A}_{\mathrm{f\text{-}gram}}$ models—without increasing inference-time FLOPS or accelerator memory. Empirical results on GPT-2–scale pretraining show perplexity improvements and strong zero-shot gains on downstream tasks, with sizable reductions in inference cost relative to larger baselines. The approach enables efficient capacity expansion for latency-sensitive deployments by shifting heavy embedding learning to training and caching to off-accelerator storage, while preserving a fixed inference footprint.

Abstract

We propose $SCONE$ ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram $E$mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, $SCONE$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. $SCONE$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

Scaling Embedding Layers in Language Models

TL;DR

SCONE introduces a scalable, off-accelerated embedding expansion by learning contextualized -gram embeddings with a separate transformer () and caching their outputs for inference via . By decoupling the -gram embeddings from the token vocabulary, SCONE enables two new scaling axes—more -grams and larger models—without increasing inference-time FLOPS or accelerator memory. Empirical results on GPT-2–scale pretraining show perplexity improvements and strong zero-shot gains on downstream tasks, with sizable reductions in inference cost relative to larger baselines. The approach enables efficient capacity expansion for latency-sensitive deployments by shifting heavy embedding learning to training and caching to off-accelerator storage, while preserving a fixed inference footprint.

Abstract

We propose (calable, ontextualized, ffloaded, -gram mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

Paper Structure

This paper contains 35 sections, 13 figures, 7 tables, 4 algorithms.

Figures (13)

  • Figure 1: Illustration of Scone with a maximum $n$-gram length of 3. The f-grams are a set of frequent $n$-grams selected using the method described in \ref{['subsec:key_discovery']}.
  • Figure 2: (Left) Perplexity on the OLMo OLMo evaluation set. Model sizes along the $x$-axis indicate the number of parameters residing on the accelerator during inference. With 10M f-grams, the 1.3B model matches the performance of the 1.9B baseline; with 1B f-grams, the 1B model surpasses it. (Right) End-to-end token generation speed on a single A100 GPU. Storing f-gram embeddings in main memory adds negligible latency, while using NVMe storage introduces a minor slowdown without causing a bottleneck.
  • Figure 3: Number of unique $2$- to $6$-grams appearing at least five times. We uniformly sample tokenized sequences from Dolma soldaini2024dolma to vary the corpus size.
  • Figure 4: Evaluation perplexity on WebText (left) and WikiText-103 (right) as a function of $|V_{\mathrm{f\text{-}gram}}|$. Model sizes in the legend are number of parameters residing on the accelerator during inference. Dashed lines and leftmost stars show baseline performance.
  • Figure 5: Effect of the maximum f-gram length $n$ on perplexity and matched length. Perplexity decreases as the maximum length increases from 2 to 4, then plateaus with minor fluctuations. Similarly, the average matched length stabilizes after length 4.
  • ...and 8 more figures