Table of Contents
Fetching ...

Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

Sungjoon Park, Varun Ramamurthi, Owen Terry

TL;DR

The paper investigates using neologisms—inserted vocabulary tokens with frozen model weights—to steer large language models in a parameter-efficient manner. It compares neologism learning against LoRA-based fine-tuning under matched data and hyperparameters, finding neologisms often achieve comparable or superior concept adherence with far fewer trainable parameters. The study further explores self-verbalization behaviors and the flexibility of steering via natural-language modifiers, highlighting practical advantages for modular, low-cost control. While acknowledging limitations in training setups, the results support neologism-based steering as a promising alternative to traditional fine-tuning for targeted behavioral control in LLMs.

Abstract

In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering

TL;DR

The paper investigates using neologisms—inserted vocabulary tokens with frozen model weights—to steer large language models in a parameter-efficient manner. It compares neologism learning against LoRA-based fine-tuning under matched data and hyperparameters, finding neologisms often achieve comparable or superior concept adherence with far fewer trainable parameters. The study further explores self-verbalization behaviors and the flexibility of steering via natural-language modifiers, highlighting practical advantages for modular, low-cost control. While acknowledging limitations in training setups, the results support neologism-based steering as a promising alternative to traditional fine-tuning for targeted behavioral control in LLMs.

Abstract

In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.

Paper Structure

This paper contains 24 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Word count statistics across inference runs for concept " short".
  • Figure 2: LLM-as-a-judge statistics various inference runs for concept " kidmode".
  • Figure 3: Mean capability scores across all inference runs.
  • Figure 4: