Table of Contents
Fetching ...

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

Anton Voronov, Lena Wolf, Max Ryabinin

TL;DR

This paper reveals that prompt template choice is a major determinant of in-context learning performance and does not transfer well across models, tasks, or prompting methods. It conducts a broad, multi-model, multi-dataset analysis showing no universal best template and substantial cross-setup variability. To address this, it introduces Template Ensembles, a simple test-time augmentation that averages predictions across templates, yielding higher mean accuracy and reduced variance. The work highlights the need for robust, multi-template evaluation in ICL research and provides a practical path toward more reliable comparisons of prompting methods.

Abstract

Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements

TL;DR

This paper reveals that prompt template choice is a major determinant of in-context learning performance and does not transfer well across models, tasks, or prompting methods. It conducts a broad, multi-model, multi-dataset analysis showing no universal best template and substantial cross-setup variability. To address this, it introduces Template Ensembles, a simple test-time augmentation that averages predictions across templates, yielding higher mean accuracy and reduced variance. The work highlights the need for robust, multi-template evaluation in ICL research and provides a practical path toward more reliable comparisons of prompting methods.

Abstract

Large language models demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
Paper Structure (37 sections, 10 figures, 20 tables)

This paper contains 37 sections, 10 figures, 20 tables.

Figures (10)

  • Figure 1: An example template transformation for two demonstrations. Different prompt formats lead to different rankings both for models and ICL methods, and the best template for one method can be suboptimal for others.
  • Figure 2: Comparison of in-context learning prediction methods in the 2-shot setting.
  • Figure 3: Comparison of the selection methods in the Direct 4-shot setting. For the evaluation results of other models and datasets, please refer to \ref{['app:selection_methods_full']}.
  • Figure 4: IoU of top-10 templates for all base models with 2 random demonstrations and the Direct prediction method on the DBPedia dataset.
  • Figure 5: IoU of 10 best templates for example selection methods on the AG News dataset. Method-N indicates that Method was used to select $N$ examples.
  • ...and 5 more figures