Table of Contents
Fetching ...

BERTs are Generative In-Context Learners

David Samuel

TL;DR

This work enables an existing masked model, DeBERTa, to perform generative tasks without additional training or architectural changes, and reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks.

Abstract

While in-context learning is commonly associated with causal language models, such as GPT, we demonstrate that this capability also 'emerges' in masked language models. Through an embarrassingly simple inference technique, we enable an existing masked model, DeBERTa, to perform generative tasks without additional training or architectural changes. Our evaluation reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. These complementary strengths suggest that the field's focus on causal models for in-context learning may be limiting - both architectures can develop these capabilities, but with distinct advantages; pointing toward promising hybrid approaches that combine the strengths of both objectives.

BERTs are Generative In-Context Learners

TL;DR

This work enables an existing masked model, DeBERTa, to perform generative tasks without additional training or architectural changes, and reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks.

Abstract

While in-context learning is commonly associated with causal language models, such as GPT, we demonstrate that this capability also 'emerges' in masked language models. Through an embarrassingly simple inference technique, we enable an existing masked model, DeBERTa, to perform generative tasks without additional training or architectural changes. Our evaluation reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. These complementary strengths suggest that the field's focus on causal models for in-context learning may be limiting - both architectures can develop these capabilities, but with distinct advantages; pointing toward promising hybrid approaches that combine the strengths of both objectives.
Paper Structure (66 sections, 2 equations, 6 figures, 7 tables)

This paper contains 66 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The average 1-shot performance across four groups of NLP tasks We compare the scaling abilities of DeBERTa (four sizes in red) with GPT-3 (eight sizes in blue). Even though these models rely on different training objectives, they scale in a similar log-linear manner overall. Yet, on a task-by-task basis, the pretraining methods lead to substantial differences between them.
  • Figure 2: Illustration of the proposed methods for using a masked language model for text generation and text ranking We show how to adapt a masked language model for in-context-learning tasks through simple input reformatting, requiring no additional training. Left: Text generation is achieved by 1) appending [MASK] tokens to the input prompt, 2) predicting the next token for the first mask, and 3) iteratively appending new masks and predicting tokens. Right: A similar approach is used to retrieve a pseudo-log-likelihood score of a text sequence that can be used to rank multiple sequences by their individual likelihoods. Both methods maintain the model's original architecture while enabling new capabilities through careful input formatting.
  • Figure 3: Length generalization measured with a 'needle in a haystack' benchmark The $x$-axis indicates the total size of the 'haystack' and the $y$-axis indicates the position of the 'needle'; the values show the average exact-match accuracy for a particular configuration. Unfortunately, GPT-3 is a closed-source model and the original version is not accessible, so we use an open-source replication of GPT-3, OPT by zhang2022opt, which should perform similarly on this task because of the the same transformer architecture as GPT-3. In particular, it uses absolute positional encoding, which strictly limits any model from generalizing to longer inputs than trained on.
  • Figure 4: The performance improvement with increased number of in-context examples We compare the in-context learning ability of 1.5B DeBERTa (in red) with 1.3B GPT-3 (in blue) using prompts without any completed examples (0-shot), prompts with a single randomly sampled gold sample (1-shot), and prompts with few examples (4 -- 64 examples, depending on the task). This figure demonstrates that a masked language model behaves similarly to a causal language model in the in-context learning regime. More detailed few-shot evaluation is in \ref{['fig:glue-shots']}.
  • Figure 5: The average performance on the SuperGLUE benchmarks as a function of number of shots As opposed to the other SuperGLUE few-shot results, where we select the number of shots for each subtask according to the performance on its training split, here all subtasks are evaluated with the same number of shots. In this way, we can compare DeBERTa 1.5B directly to Figure 3.8 from NEURIPS2020_1457c0d6, which gives the same evaluation for GPT 175B (unfortunately not for smaller, more comparable, models).
  • ...and 1 more figures