Table of Contents
Fetching ...

A Single Character can Make or Break Your LLM Evals

Jingtong Su, Jianyu Zhang, Karen Ullrich, Léon Bottou, Mark Ibrahim

TL;DR

The paper reveals that a single character delimiter used to separate in-context demonstrations can dramatically alter LLM evaluation outcomes, across both open-source families (Llama, Gemma, Qwen) and even closed models (GPT-4o). It establishes a common evaluation protocol and systematically varies 30 ASCII delimiters across multiple benchmarks (MMLU, ARC-Challenge, CommonsenseQA), showing performance swings up to $29.4\%$ on MMLU and substantial ranking shifts. The authors demonstrate that specifying the delimiter in the prompt improves robustness (with gains up to $27.9\%$ on some tasks) and that certain delimiters steer attention toward relevant input tokens, revealing a mechanistic link between prompt formatting and inference. They offer practical recommendations (e.g., newline or exclamation delimiters) and call for broader studies of formatting brittleness to ensure more reliable benchmarking and real-world prompting.

Abstract

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23\%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

A Single Character can Make or Break Your LLM Evals

TL;DR

The paper reveals that a single character delimiter used to separate in-context demonstrations can dramatically alter LLM evaluation outcomes, across both open-source families (Llama, Gemma, Qwen) and even closed models (GPT-4o). It establishes a common evaluation protocol and systematically varies 30 ASCII delimiters across multiple benchmarks (MMLU, ARC-Challenge, CommonsenseQA), showing performance swings up to on MMLU and substantial ranking shifts. The authors demonstrate that specifying the delimiter in the prompt improves robustness (with gains up to on some tasks) and that certain delimiters steer attention toward relevant input tokens, revealing a mechanistic link between prompt formatting and inference. They offer practical recommendations (e.g., newline or exclamation delimiters) and call for broader studies of formatting brittleness to ensure more reliable benchmarking and real-world prompting.

Abstract

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

Paper Structure

This paper contains 35 sections, 7 figures, 37 tables.

Figures (7)

  • Figure 1: One can manipulate rankings to put any model in the lead by varying the single delimiter character. On the left, we show the delimiter used to separate examples in common evals with few-shot examples such as mmlu. On the right, we show model rankings based on mmlu performance as the example delimiter varies with each column corresponding to a different ranking.
  • Figure 2: Changing a single delimiter character can dramatically change performance across model families. We show model performance across Llama, Qwen, and Gemma families on mmlu, arc-challenge, and commonsense-qa as we vary only the example delimiter (shown above each bar in blue).
  • Figure 3: The choice of delimiter affects performance across a range of topics. We show the accuracy by topic for mmlu across three model families. The choice of delimiter (shown above each bar in blue) affects performance across a range of topics across the three model families.
  • Figure 4: Larger models are just as brittle to the change in delimiter. We compare the performance of Llama-3.1-instruct across two sizes 8B and 70B as the delimiter varies (shown above each bar in blue). We find model scale despite improving overall performance across all three benchmarks, the larger Llama model is just as susceptible to the choice of delimiter, with a fluctuation on commonsense-qa of 40% (an even larger change compared to the smaller model).
  • Figure 5: The effect of delimiter ("[space]" or "\\ n") on Llama-3.1-8B-instruct and Qwen2.5-7B-instruct in-context learning performance. Delimiter dramatically changes the in-context learning performance regardless of model or the number of demonstrations.
  • ...and 2 more figures