Table of Contents
Fetching ...

Concept Training for Human-Aligned Language Models

Christine Zhang, Dan Jurafsky, Chen Shani

Abstract

The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

Concept Training for Human-Aligned Language Models

Abstract

The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

Paper Structure

This paper contains 29 sections, 3 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Concept Training Pipeline. (a) depicts the data processing phase for enriching it with concept signal, and (b) shows the loss function modification.
  • Figure 2: We post-trained using two datasets, three base models, and five concept weights. Baselines (base pretrained model with no training and concept weight 0 = pure NTP training) are in dashed black. All model variants with their performance are in Appendix \ref{['appendix:all_results']}.
  • Figure 3: Models with higher levels of concept awareness align more closely with human semantic intuition on MEN, WordSim353, SimLex-999, and STS-B. Pure NTP post-training slightly degrades semantic understanding, while concept supervision improves alignment with human semantic judgments. For SimLex-999, concept training helps except at concept weight 1, suggesting token-level supervision is more important for this benchmark. Red dots indicate baseline models (pretrained and pure NTP post-trained models).
  • Figure 4: Content word NTP perplexity and accuracy using a held-out set of 1,000 C4 and OpenWebText (OWT) samples. The in-domain plots show results for models trained on one dataset and evaluated on the same dataset, while OOD means training on C4 and evaluating on OWT or vice versa (training data in legend). Performance generally improves as concept supervision increases, with some models showing a small dip between concept weights 0.75 and 1.0. Red dots indicate baseline models (pretrained models and the pure NTP post-trained models).
  • Figure 5: Clustering ability generally improves as concept weight increases, with degradation at concept weight 1. These results show that concept training helps models to learn better semantic distinctions, but some amount of NTP supervision is necessary to maintain general semantic understanding. Red dots indicate baseline models.
  • ...and 11 more figures