Order-Independence Without Fine Tuning

Reid McIlroy-Young; Katrina Brown; Conlan Olson; Linjun Zhang; Cynthia Dwork

Order-Independence Without Fine Tuning

Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, Cynthia Dwork

TL;DR

It is shown that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings.

Abstract

The development of generative language models that can create long and coherent textual outputs via autoregression has lead to a proliferation of uses and a corresponding sweep of analyses as researches work to determine the limitations of this new paradigm. Unlike humans, these 'Large Language Models' (LLMs) are highly sensitive to small changes in their inputs, leading to unwanted inconsistency in their behavior. One problematic inconsistency when LLMs are used to answer multiple-choice questions or analyze multiple inputs is order dependency: the output of an LLM can (and often does) change significantly when sub-sequences are swapped, despite both orderings being semantically identical. In this paper we present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of our method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses, and usually significantly less in practice. Thus, Set-Based Prompting can be used as a 'dropped-in' method on fully trained models. Finally, we discuss how our method's success suggests that other strong guarantees can be obtained on LLM performance via modifying the input representations.

Order-Independence Without Fine Tuning

TL;DR

Abstract

Paper Structure (28 sections, 2 theorems, 23 equations, 11 figures, 1 table)

This paper contains 28 sections, 2 theorems, 23 equations, 11 figures, 1 table.

Introduction
Related Works
Set-Based Prompting
Background
Provable Order-Independence via Set-Based Prompting
Methodology and Theoretical Guarantees
Attention Mask
Performance
CommonSenseQA
Measuring Massive Multitask Language Understanding
Ablations
Other Tests
Discussion
Further Explorations
Towards Metadata in LLM Inputs
...and 13 more sections

Key Result

Theorem 1

Given $\bm{M}_{i,j}^{k,f}$ as in Equation eq:new_mask, fix any permutation function $\tau$ on the indices $1,\dots,\ell$ of the sub-queries $S = \left\{s_1,\dots, s_k,\dots,s_\ell\right\}$ for the attention mechanism, so that applying $\tau$ to the blocks of column vectors corresponding to the $\ell and

Figures (11)

Figure 1: Illustration of order dependency in Llama 2, 7B. Using the order provided by (Measuring Massive Multitask Language Understanding) (MMLU) hendrycks2021measuring Llama 2 gets the question correct as seen in a), but if the order is reversed for the questions Llama 2, 7B predictions the wrong answer. In c) we use Set-Based Prompting to remove the ordering of the answers and Llama 2, 7B once again gets the question correct.
Figure 2: Visualization of the differences between order-dependant prompting (left) and Set-Based Prompting (right). Our input is the prompt 'the aptly quick light reddy brown fox' and 'aptly quick' is in parallel to 'light reddy brown'. Each row represents a query to an attention block (we treat each word as a token), with the index of the query given by $i$. $\bm{X}$ and $\bm{X}_{s}$ give the set of values over which the query is attending. $\bm{p}(i,j)$ is the vector-valued positional encoding which is added to the word's embedding. The center of the diagram is the attention mask $\bm{M}_{j,i}$.
Figure 3: Per model accuracy on two different datasets, blue bars (left three) indicate runs done without our method and green with Set-Based Prompting . The blue bars are constructed by running the test twice, once with the normal ordering and once with the reversed ordering. Worst of 2 and Best of 2 count when both orderings lead to an correct answer or only one ordering answered correctly, respectively. While Best of 1 indicates that the normal ordering led to correct answers. As Set-Based Prompting is invariant to reordering so we only show one bar for all orderings.
Figure 4: MMLU results for a subset of models across all possible permutations ($4!$) of the ordering of the options, with the accuracy under Set-Based Prompting indicated with a diamond. Dots are ordered by accuracy with in each model's results, boxes show the quartiles across the different ordering.
Figure 5: Accuracy per model on MMLU, with error bars showing the variation in accuracy under two orderings. The conditions are unmodified model, only the positional encoding $\bm{p}(i,j)$ modified, only the attention mask $\bm{M}_{i,j}^{k,f}$ modified, and Set-Based Prompting
...and 6 more figures

Theorems & Definitions (2)

Theorem 1
Theorem 2

Order-Independence Without Fine Tuning

TL;DR

Abstract

Order-Independence Without Fine Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)