Table of Contents
Fetching ...

Similarity-Distance-Magnitude Language Models

Allen Schmaltz

TL;DR

This work introduces Similarity-Distance-Magnitude ($\nSDM$) language models, which fine-tune decoder-only LMs to maximize the fraction of generations that fall inside a well-calibrated, high-probability region defined by a final-layer $\nSDM$ activation. The approach combines a document-level binary classifier at training time with a contrastive encoding scheme ($\encodingContrastiveMasking$) and online generation of hard negatives to shape the next-token loss, thereby reducing abstentions and improving statistical efficiency. Key innovations include (i) a final-layer $\nSDM$ activation used for both test-time classification and training-time loss weighting, (ii) a contrastive masking encoding scheme, and (iii) online hard-negative generation to expose the model to challenging, instruction-violating completions. Experiments on a word-ordering task with a 3.8B decoder show that SDM fine-tuning significantly increases the proportion of in-distribution generations within the high-probability region, with robust behavior under distributional shifts and only modest changes to marginal accuracy, demonstrating practical gains for uncertainty-aware selective generation in large LMs.

Abstract

We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.

Similarity-Distance-Magnitude Language Models

TL;DR

This work introduces Similarity-Distance-Magnitude () language models, which fine-tune decoder-only LMs to maximize the fraction of generations that fall inside a well-calibrated, high-probability region defined by a final-layer activation. The approach combines a document-level binary classifier at training time with a contrastive encoding scheme () and online generation of hard negatives to shape the next-token loss, thereby reducing abstentions and improving statistical efficiency. Key innovations include (i) a final-layer activation used for both test-time classification and training-time loss weighting, (ii) a contrastive masking encoding scheme, and (iii) online hard-negative generation to expose the model to challenging, instruction-violating completions. Experiments on a word-ordering task with a 3.8B decoder show that SDM fine-tuning significantly increases the proportion of in-distribution generations within the high-probability region, with robust behavior under distributional shifts and only modest changes to marginal accuracy, demonstrating practical gains for uncertainty-aware selective generation in large LMs.

Abstract

We introduce Similarity-Distance-Magnitude (SDM) language models (LMs), which are sequence prediction models fine-tuned to maximize the proportion of generations in the well-calibrated, high-probability region partitioned by a final-layer SDM activation layer used for binary classification of instruction-following. We demonstrate that existing pre-trained decoder-only Transformer LMs can be readily converted into SDM LMs via supervised fine-tuning, using the final-layer SDM activation layer during training to estimate a change-of-base for a supervised next-token loss over a contrastive input encoding scheme, with additional hard negative examples generated online during training. This results in reduced abstentions (i.e., improved statistical efficiency) compared to strong supervised baselines.

Paper Structure

This paper contains 22 sections, 1 equation, 5 tables.