LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Szilvia Ujváry; Louis Béthune; Pierre Ablin; João Monteiro; Marco Cuturi; Michael Kirchhof

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof

TL;DR

Small language models struggle to store exact factual knowledge within limited parameters, motivating selective token learning and the delegation of hard factual tokens to larger models. LaCy combines a loss signal with spaCy-based acceptability to decide which tokens to learn and which to delegate, enabling a cascade with a bigger model when needed. Across 334M and 1.3B parameter SLMs, LaCy achieves higher FactScore and lower factual leakage than loss-based, LLM-judge, and Rho baselines, while maintaining simplicity and low overhead. The study demonstrates that loss alone is insufficient for factual accuracy and that targeted delegation can provide scalable, cost-effective improvements for factual generation in SLMs.

Abstract

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{<CALL>} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

TL;DR

Abstract

Paper Structure (61 sections, 8 equations, 24 figures, 6 tables)

This paper contains 61 sections, 8 equations, 24 figures, 6 tables.

Introduction
Background
Why not learn all tokens?
Which tokens are learnable?
What to do once an SLM calls for help?
Loss alone cannot identify factual errors
Defining Acceptability.
Examples.
Experiment.
LaCy: don't learn what you can't
Experiments
Experimental Setup
Data.
Pretraining.
Inference and Cascading
...and 46 more sections

Figures (24)

Figure 1: Overview of the LaCy framework. We decide which tokens an SLM can and should learn during pretrained based on its loss signal and a spaCy grammar processor. If it is a fact token that is too hard for this small model, we train to output a <CALL> token. At inference time, this triggers a larger model to step in. This enables the SLM to learn what it can predict, mitigating factual errors.
Figure 2: Results overview for pretraining a 334M SLM.(Left.) The LaCy-trained SLM achieves the highest FactScore when generating biography with Llama 3.2 1B as cascade partner, confirming that it successfully generates calls at factual token positions. (Right.)Without calling, LaCy has lowest fact leakage, meaning the least facts were trained into the limited parametric SLM memory.
Figure 3: The difference between Accuracy and Acceptability. The token loss is predictive of whether a token is likely to match its exact ground-truth token (left). However, this signal is blind to the type of token: Non-factual tokens are considered equally wrong as factual tokens, although non-factual tokens with high loss often do not render an output false (right). We utilize a SpaCy grammar parser during pretraining to tell these two signals apart.
Figure 4: Generations from 334 million parameter models. The task is bibliography generation, the prompt is given in italic. <CALL> retrieved tokens from Llama 3.2 1B are highlighted in gray. Factual statements are colored in green for true, and red for false statements, as scored by FactScore min-etal-2023-factscore. LaCy and LLM judge call successfully delegate factual tokens, acquiring information on nationality, profession and dates. Rho-1 retrieves many useless tokens and has to rely on its own factual knowledge.
Figure 5: Comparison of validation losses: LaCy distinguishes most the tokens it learns from the tokens it does not learn.(Left.) Call losses. (Right). Non-call losses. For each <CALL>-augmented method, we construct its call mask by selecting the top 15% call logits in a batch. Full colors show the loss values of the <CALL>-augmented methods, while light colors show the loss of a vanilla baseline evaluated on the same<CALL> mask. LaCy calls on high-loss tokens (baseline call loss is high), and learns even less about them, achieving a call loss of $5.72$. Its non-call loss is competitive with the factuality-based LLM judge.
...and 19 more figures

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

TL;DR

Abstract

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (24)