Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

Marcus Buckmann; Edward Hill

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

Marcus Buckmann, Edward Hill

TL;DR

It is shown that penalised logistic regression on the embeddings from a small LLM equals the performance of a large LLM in the"tens-of-shot"regime, and stable and sensible explanations for classification decisions are extracted.

Abstract

For simple classification tasks, we show that users can benefit from the advantages of using small, local, generative language models instead of large commercial models without a trade-off in performance or introducing extra labelling costs. These advantages, including those around privacy, availability, cost, and explainability, are important both in commercial applications and in the broader democratisation of AI. Through experiments on 17 sentence classification tasks (2-4 classes), we show that penalised logistic regression on the embeddings from a small LLM equals (and usually betters) the performance of a large LLM in the "tens-of-shot" regime. This requires no more labelled instances than are needed to validate the performance of the large LLM. Finally, we extract stable and sensible explanations for classification decisions.

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

TL;DR

Abstract

Paper Structure (39 sections, 3 equations, 24 figures, 4 tables)

This paper contains 39 sections, 3 equations, 24 figures, 4 tables.

Introduction
Literature review
Methodology
Prompt construction
Embedding
Text prediction
Penalised logistic regression
Implementation
Performance comparison: Small training set, large test set
Relating next token prediction, logits, and embeddings
Robustness: Omitting instructions from the prompt
Robustness: Choice of prefix and suffix
Robustness: Model size and quantization
Robustness: In-context few-shot learning
Performance comparison: "tens-of-shot" labelled data -- small training and test sets
...and 24 more sections

Figures (24)

Figure 1: Left: Comparing the accuracies from GPT-4 and our method (PLR-E) over the 17 classification tasks. Right: We show, for increasing training sample sizes (divided by the number of classes), for how many datasets our method is outperforms than GPT-4 on accuracy, and vice versa (solid lines). We also count in how many of these cases we can confidently declare a 'winner' above statistical noise (significance level: $\alpha = 0.1$, two-sided)
Figure 2: The accuracies of the zero-shot next token text predictions from GPT-4 and Llama2-7B, along with with the learning curves for the PLR-L and PLR-E methods applied to our baseline model (Llama2-7B q4.0).
Figure 3: Comparing GPT-4 to our baseline model. From left, using the baseline model's next token prediction, learning from its logits (PLR-L), and learning from its embeddings (PLR-E).
Figure 4: The accuracy of PLR-E when trained on embeddings from different models and promptings, c.f. Figure \ref{['fig_baseline_learning']}.
Figure 5: Instruction prompting. Top panel: Comparing the performance when using PLR-E on our baseline model with and without surrounding instructions. Middle and bottom panels: Comparing our baseline PLR-E (with instructions) against applying PLR-E to the embeddings from two sentence embedding models with (red arrows) and without instructions (black crosses).
...and 19 more figures

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

TL;DR

Abstract

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (24)