Table of Contents
Fetching ...

URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Dongyang Fan, Vinko Sabolčec, Martin Jaggi

TL;DR

The paper investigates whether contextual metadata can improve LLM pretraining and generation. It introduces context-conditioned pretraining with explicit context tokens and a 90/10 token mix, plus context-aware generation with context-free, context-conditioned, and context-guided sampling. The key finding is that only URL metadata accelerates pretraining and yields downstream gains with longer prompts, while quality scores and domain information do not speed training but enable steerable outputs. Overall, context-aware pretraining enhances controllability via classifier-free guidance, suggesting a dual role for metadata: training efficiency (URL) and inference controllability (topic/format). The work highlights practical implications for building more efficient, steerable LLMs and points to scaling and broader metadata as future directions.

Abstract

Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.

URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

TL;DR

The paper investigates whether contextual metadata can improve LLM pretraining and generation. It introduces context-conditioned pretraining with explicit context tokens and a 90/10 token mix, plus context-aware generation with context-free, context-conditioned, and context-guided sampling. The key finding is that only URL metadata accelerates pretraining and yields downstream gains with longer prompts, while quality scores and domain information do not speed training but enable steerable outputs. Overall, context-aware pretraining enhances controllability via classifier-free guidance, suggesting a dual role for metadata: training efficiency (URL) and inference controllability (topic/format). The work highlights practical implications for building more efficient, steerable LLMs and points to scaling and broader metadata as future directions.

Abstract

Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.

Paper Structure

This paper contains 24 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: An example of our context-aware tokenization. Each document begins with a default beginning-of-sequence (<s>) token. For each sequence, a context segment wrapped in beginning-of-context (<boc>) and end-of-context (<eoc>) is inserted after <s> and before the main text. If a document is too long and split into multiple sequences, the context is prepended to each one. Although the context is added to every sequence, it may be empty. In 90% of the corpus, we include a non-empty context; in the remaining 10%, the context is left empty. Contexts can be URL, quality score, or topic/format domains depending on user choices.
  • Figure 2: Diagram of our two-stage investigation. During pretraining time, we feed a uniform mixture of 90% context-prepended texts and 10% context-free standard texts into the model. During inference time, we compare three different generation sampling methods.
  • Figure 3: Training perplexity versus the amount of consumed tokens. Prepending the URL leads to a faster decrease in perplexity.
  • Figure 4: URL-conditioned pretraining achieves the same downstream evaluation performances of 100B-token standard pretraining with only 60B tokens. The same plots with respect to all tasks are provided in Figure \ref{['fig:eval-speed-up-02']}.
  • Figure 5: Average attention to different parts of different prepended contexts. Take https://en.wikipedia.org/wiki/Metadata#Standards as an example. en.wikipedia.org is the URL domain and /wiki/Metadata#Standards is the URL suffix. More details in Appendix \ref{['app: attention-pattern']}.
  • ...and 3 more figures