Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

Davis Yoshida

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

Davis Yoshida

TL;DR

This work investigates how to maximize the utility of pretrained transformers by introducing methods that extend their effective context and generation capabilities without full-scale retraining. It presents three core finetuning/inference strategies: (1) Adding Recurrence to enable longer context and reduce compute in decoders, (2) Hidden-State Optimization (HSO) to improve inference-time predictions without updating weights, and (3) RUM-SUNDAE, a method to convert masked language models into non-autoregressive encoder–decoders for faster MT and code translation. The thesis further analyzes the longstanding bad-mode problem in NLG, offering nuanced theoretical and empirical insights and introducing conditional beam search to find high-likelihood, high-quality outputs under controlled constraints. Together, these contributions broaden the practical applicability of existing pretrained models across languages and domains, offering more efficient, controllable, and robust generation without extensive retraining. The work also surveys relevant decoding strategies, data distributions, and evaluation notions to better align model likelihood with output quality in real-world settings.

Abstract

This thesis provides methods and analysis of models which make progress on this goal. The techniques outlined are task agnostic, and should provide benefit when used with nearly any transformer LM. We introduce two new finetuning methods which add new capabilities to the models they are used on. The first adds a recurrence mechanism, which removes the fixed-window sized constraint and improves the efficiency of a transformer decoder. The second allows masked language models (MLMs) to be used for initialization of both the encoder and decoder of a non-autoregressive sequence-to-sequence transformer, opening up generative applications of models which were previously only used for natural language understanding tasks. We also introduce two new techniques for improving the quality of predictions of any transformer decoder without additional finetuning. One, hidden state optimization, can be applied to any transformer decoder to improve the quality of predictions at inference time, especially for few-shot classification. The other, conditional beam search, allows practitioners to search for natural language generation (NLG) model outputs with high likelihood while conditioning on the event that the output is not degenerate (e.g. empty, repetitive, etc.). Finally, we provide theoretical and empirical insights on the divergence of model-likelihood and output quality which has widely been observed in prior work. These insights apply to any model which represents a distribution over text, and apply to language models which are not transformers or even autoregressive. We argue that the NLP community has, to some extent, misunderstood the implications of these findings, and encourage a point of view which has more nuance.

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

TL;DR

Abstract

Paper Structure (220 sections, 39 equations, 12 figures, 26 tables, 1 algorithm)

This paper contains 220 sections, 39 equations, 12 figures, 26 tables, 1 algorithm.

Introduction
LMs and NLG: Background and related work
Language modeling, transformers, and pretraining
Three families of pretrained models
Coherent text generation
Few and Zero-shot learning/Prompting
The current generation of LMs: Training for instruction following
Improving the efficiency of transformers
Efficient finetuning
Reducing memory requirements of pretrained models
Modifying and controlling generation
Decoding strategies
Specializing models via finetuning
Controllable generation
Contributions
...and 205 more sections

Figures (12)

Figure 1: Augmenting a pretrained transformer with a recurrence module, allowing reduction of attention computation as well as simpler processing of longer contexts.
Figure 2: $\bm{h}_{\text{prev}}$ is added as an additional key and value to one self-attention layer. Arrows show which positions can pass information to which other positions.
Figure 3: Varying degree of overlap while evaluating a transformer with a window size of 3. The green (top) circles are outputs, and the blue (bottom) circles are inputs.
Figure 4: Effect of window size on performance on PG-19 validation set.
Figure 5: Relationship between FLOPs and perplexity for recurrent and non-recurrent models. Curves range over window sizes from 200 to 600.
...and 7 more figures

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

TL;DR

Abstract

Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (12)