Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering
Sungjoon Park, Varun Ramamurthi, Owen Terry
TL;DR
The paper investigates using neologisms—inserted vocabulary tokens with frozen model weights—to steer large language models in a parameter-efficient manner. It compares neologism learning against LoRA-based fine-tuning under matched data and hyperparameters, finding neologisms often achieve comparable or superior concept adherence with far fewer trainable parameters. The study further explores self-verbalization behaviors and the flexibility of steering via natural-language modifiers, highlighting practical advantages for modular, low-cost control. While acknowledging limitations in training setups, the results support neologism-based steering as a promising alternative to traditional fine-tuning for targeted behavioral control in LLMs.
Abstract
In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.
