Table of Contents
Fetching ...

Comparing Template-based and Template-free Language Model Probing

Sagi Shaier, Kevin Bennett, Lawrence E Hunter, Katharina von der Wense

TL;DR

Template-free and template-based approaches often rank models differently, except for the top domain- specific models, which is less common when employing template-free techniques.

Abstract

The differences between cloze-task language model (LM) probing with 1) expert-made templates and 2) naturally-occurring text have often been overlooked. Here, we evaluate 16 different LMs on 10 probing English datasets -- 4 template-based and 6 template-free -- in general and biomedical domains to answer the following research questions: (RQ1) Do model rankings differ between the two approaches? (RQ2) Do models' absolute scores differ between the two approaches? (RQ3) Do the answers to RQ1 and RQ2 differ between general and domain-specific models? Our findings are: 1) Template-free and template-based approaches often rank models differently, except for the top domain-specific models. 2) Scores decrease by up to 42% Acc@1 when comparing parallel template-free and template-based prompts. 3) Perplexity is negatively correlated with accuracy in the template-free approach, but, counter-intuitively, they are positively correlated for template-based probing. 4) Models tend to predict the same answers frequently across prompts for template-based probing, which is less common when employing template-free techniques.

Comparing Template-based and Template-free Language Model Probing

TL;DR

Template-free and template-based approaches often rank models differently, except for the top domain- specific models, which is less common when employing template-free techniques.

Abstract

The differences between cloze-task language model (LM) probing with 1) expert-made templates and 2) naturally-occurring text have often been overlooked. Here, we evaluate 16 different LMs on 10 probing English datasets -- 4 template-based and 6 template-free -- in general and biomedical domains to answer the following research questions: (RQ1) Do model rankings differ between the two approaches? (RQ2) Do models' absolute scores differ between the two approaches? (RQ3) Do the answers to RQ1 and RQ2 differ between general and domain-specific models? Our findings are: 1) Template-free and template-based approaches often rank models differently, except for the top domain-specific models. 2) Scores decrease by up to 42% Acc@1 when comparing parallel template-free and template-based prompts. 3) Perplexity is negatively correlated with accuracy in the template-free approach, but, counter-intuitively, they are positively correlated for template-based probing. 4) Models tend to predict the same answers frequently across prompts for template-based probing, which is less common when employing template-free techniques.
Paper Structure (43 sections, 1 equation, 2 figures, 6 tables)

This paper contains 43 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Template-free vs. template-based: We evaluate the percentage of times each entity appears in the top 10 predictions for each prompt. We show the results for the top 15 most frequent entities. Next to each model's name we also add the percentage of unique entities it predicts over all prompts for top 1, 5, and 10.
  • Figure 2: Template-free vs. Template-based: Average Acc@1 vs average Perplexity per model, over datasets. Template-based Pearson’s correlation coefficient: 0.83, p-value=0.16. Template-free Pearson’s correlation coefficient: 0.60, p-value=0.20.