Table of Contents
Fetching ...

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Sebastian Sztwiertnia, Felix Friedrich, Kristian Kersting, Patrick Schramowski, Björn Deiseroth

TL;DR

Decoder-only LLM pre-training is highly data-intensive; LIME introduces linguistic metadata embeddings to enrich token representations with POS and NER signals, improving data efficiency and modeling performance with negligible parameter overhead. The four-stage LIME pipeline (linguistic pre-tokenization, metadata annotation, granularity alignment, and metadata embeddings) plus a look-ahead variant LIME+1 enhances generation, including reasoning and arithmetic, across model scales from 500M to 2B parameters. Empirical results show up to 56% faster adaptation to the training distribution, improved next-token accuracy and perplexity, and strong gains in generative benchmarks, with LIME+1 providing substantial gains in reasoning and arithmetic tasks. The work highlights metadata as a practical, tokenizer-agnostic signal that improves efficiency, token cohesion, and controllability, while suggesting future extensions to multilingual settings and richer anticipatory metadata.

Abstract

Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

TL;DR

Decoder-only LLM pre-training is highly data-intensive; LIME introduces linguistic metadata embeddings to enrich token representations with POS and NER signals, improving data efficiency and modeling performance with negligible parameter overhead. The four-stage LIME pipeline (linguistic pre-tokenization, metadata annotation, granularity alignment, and metadata embeddings) plus a look-ahead variant LIME+1 enhances generation, including reasoning and arithmetic, across model scales from 500M to 2B parameters. Empirical results show up to 56% faster adaptation to the training distribution, improved next-token accuracy and perplexity, and strong gains in generative benchmarks, with LIME+1 providing substantial gains in reasoning and arithmetic tasks. The work highlights metadata as a practical, tokenizer-agnostic signal that improves efficiency, token cohesion, and controllability, while suggesting future extensions to multilingual settings and richer anticipatory metadata.

Abstract

Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: LIME and LIME$^{\texttt{+1}}$ architecture. (1) Input text S is split by the linguistic tokenizer (${T_{li}}$). (2) Linguistic splits are annotated, e.g. with depth .6POS and depth .6NER tags. (3) Subword tokenization ($T_{sw}$) is applied to the linguistic tokens and annotations are aligned to the new splits. (4) Tokens and metadata are embedded, fused together and passed into consecutive transformer blocks.
  • Figure 2: Left: Next-token accuracy improves with metadata embedding layers. Our LIME$_{ \texttt{500M}}$ model requires $56$% less pre-training data to achieve the same token prediction accuracy as Baseline. Right: Accuracy and perplexity improvements translate consistently across model sizes.
  • Figure 2: LIME excels at generative tasks. Improvements to Base are indicated by $\uparrow$, to LIME with $\Uparrow$. We highlight (yellow) generative-format tasks. Exemplified on 500M, other model sizes in App. \ref{['app:benchmarks']}.
  • Figure 3: LIME$_{ \texttt{500M}}$ token prediction accuracy increases within semantic and syntactic metadata class boundaries: Among the 100 most impactful (share$\times \Delta$) tokens, our model exhibits improved token coupling for suffix-, digit-tokens and an exemplary entity group token depth .6␣ States .
  • Figure 4: Inference with LIME and LIME$^{\texttt{+1}}$ models.
  • ...and 3 more figures