Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

Ziqiao Ma; Zekun Wang; Joyce Chai

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

Ziqiao Ma, Zekun Wang, Joyce Chai

TL;DR

This work introduces Trial-and-Demonstration (TnD), an interactive learning framework where a student language model receives corrective feedback through trials and demonstrations from a teacher model, guided by an age-conditioned reward that tracks developmental progress. Using PPO-style updates with modified policy learning (including demonstration-based updates and removal of KL penalties), the study shows that TnD accelerates word acquisition for models of comparable or smaller size and that teacher word choices and trial frequency shape learning trajectories. The authors develop a neural age predictor to derive the reward from the student’s generated text, enabling scalable, human-free interactive learning on large corpora. The results demonstrate a promising interactive alternative to purely non-interactive pretraining, with implications for improving learning efficiency and knowledge distillation in smaller models, while also highlighting limitations and directions for future work in reward design and cross-linguistic applicability.

Abstract

Humans are efficient language learners and inherently social creatures. Our language development is largely shaped by our social interactions, for example, the demonstration and feedback from caregivers. Contrary to human language learning, recent advancements in large language models have primarily adopted a non-interactive training paradigm, and refined pre-trained models through feedback afterward. In this work, we explore how corrective feedback from interactions influences neural language acquisition from scratch through systematically controlled experiments, assessing whether it contributes to word learning efficiency in language models. We introduce a trial-and-demonstration (TnD) learning framework that incorporates three distinct components: student trials, teacher demonstrations, and a reward conditioned on language competence at various developmental stages. Our experiments reveal that the TnD approach accelerates word acquisition for student models of equal and smaller numbers of parameters, and we highlight the significance of both trials and demonstrations. We further show that the teacher's choices of words influence students' word-specific learning efficiency, and a practice-makes-perfect effect is evident by a strong correlation between the frequency of words in trials and their respective learning curves. Our findings suggest that interactive language learning, with teacher demonstrations and active trials, can facilitate efficient word learning in language models.

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

TL;DR

Abstract

Paper Structure (65 sections, 7 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 65 sections, 7 equations, 14 figures, 5 tables, 1 algorithm.

Introduction
Interactive Language Learning by Trials-and-Demonstrations (TnD)
The student model and trials
The teacher model and demonstrations
The reward and reward model
Alternating interactive and non-interactive language learning
Interactive language learning setup.
Demonstration in policy update.
Removal of KL-divergence objective.
Non-interactive language learning setup.
Alternating interactive and non-interactive learning.
Experiment and Evaluations
Experiment setup
Training corpora.
Baselines and ablation variants.
...and 50 more sections

Figures (14)

Figure 1: The learning by trial-and-demonstration (TnD) framework. In stage 1, we start by training a language model with the causal language modeling objective. In stage 2, we prompt the models along the learning trajectory for (text, step) pairs and train a neural age predictor to predict the training step given a text. In stage 3, we use the final model in stage 1 as the teacher model. In an interactive step, the student model is prompted to complete a trial, and the teacher model is prompted to provide a demonstration. The trials and demonstrations are scored by an age-conditioned reward function (Eq. \ref{['eq:reward']}), and the student model updates the policy with reinforcement learning. The student alternates between interactive and non-interactive steps.
Figure 2: We sample reward model predictions at different steps and compare them to ground truth logarithm. The reward models are satisfactory as the model predicted age/step highly overlaps with the true age/step.
Figure 3: On 2 training corpora and 2 test vocabulary, we aggregate 5 random seeds and present the fitted learning curves of mean surprisal over $\log_{10}$ training steps, with nAoA@0.5 of each curve indicated by a vertical dashed line.
Figure 4: On 2 training corpora and 2 test vocabulary, we aggregate 5 random seeds and present the neural age of acquisition (nAoA) at different surprisal thresholds from 0.5 to 0.95 with a step of 0.05.
Figure 5: On 2 training corpora and 2 test vocabulary, we aggregate 5 random seeds and evaluate the effective vocabulary size over $\log_{10}$ training steps. The dashed lines mark the tested vocabulary size.
...and 9 more figures

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

TL;DR

Abstract

Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (14)