Table of Contents
Fetching ...

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

Marcus Buckmann, Edward Hill

TL;DR

It is shown that penalised logistic regression on the embeddings from a small LLM equals the performance of a large LLM in the"tens-of-shot"regime, and stable and sensible explanations for classification decisions are extracted.

Abstract

For simple classification tasks, we show that users can benefit from the advantages of using small, local, generative language models instead of large commercial models without a trade-off in performance or introducing extra labelling costs. These advantages, including those around privacy, availability, cost, and explainability, are important both in commercial applications and in the broader democratisation of AI. Through experiments on 17 sentence classification tasks (2-4 classes), we show that penalised logistic regression on the embeddings from a small LLM equals (and usually betters) the performance of a large LLM in the "tens-of-shot" regime. This requires no more labelled instances than are needed to validate the performance of the large LLM. Finally, we extract stable and sensible explanations for classification decisions.

Logistic Regression makes small LLMs strong and explainable "tens-of-shot" classifiers

TL;DR

It is shown that penalised logistic regression on the embeddings from a small LLM equals the performance of a large LLM in the"tens-of-shot"regime, and stable and sensible explanations for classification decisions are extracted.

Abstract

For simple classification tasks, we show that users can benefit from the advantages of using small, local, generative language models instead of large commercial models without a trade-off in performance or introducing extra labelling costs. These advantages, including those around privacy, availability, cost, and explainability, are important both in commercial applications and in the broader democratisation of AI. Through experiments on 17 sentence classification tasks (2-4 classes), we show that penalised logistic regression on the embeddings from a small LLM equals (and usually betters) the performance of a large LLM in the "tens-of-shot" regime. This requires no more labelled instances than are needed to validate the performance of the large LLM. Finally, we extract stable and sensible explanations for classification decisions.
Paper Structure (39 sections, 3 equations, 24 figures, 4 tables)

This paper contains 39 sections, 3 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Left: Comparing the accuracies from GPT-4 and our method (PLR-E) over the 17 classification tasks. Right: We show, for increasing training sample sizes (divided by the number of classes), for how many datasets our method is outperforms than GPT-4 on accuracy, and vice versa (solid lines). We also count in how many of these cases we can confidently declare a 'winner' above statistical noise (significance level: $\alpha = 0.1$, two-sided)
  • Figure 2: The accuracies of the zero-shot next token text predictions from GPT-4 and Llama2-7B, along with with the learning curves for the PLR-L and PLR-E methods applied to our baseline model (Llama2-7B q4.0).
  • Figure 3: Comparing GPT-4 to our baseline model. From left, using the baseline model's next token prediction, learning from its logits (PLR-L), and learning from its embeddings (PLR-E).
  • Figure 4: The accuracy of PLR-E when trained on embeddings from different models and promptings, c.f. Figure \ref{['fig_baseline_learning']}.
  • Figure 5: Instruction prompting. Top panel: Comparing the performance when using PLR-E on our baseline model with and without surrounding instructions. Middle and bottom panels: Comparing our baseline PLR-E (with instructions) against applying PLR-E to the embeddings from two sentence embedding models with (red arrows) and without instructions (black crosses).
  • ...and 19 more figures