Semiparametric Token-Sequence Co-Supervision

Hyunji Lee; Doyoung Kim; Jihoon Jun; Sejune Joo; Joel Jang; Kyoung-Woon On; Minjoon Seo

Semiparametric Token-Sequence Co-Supervision

Hyunji Lee, Doyoung Kim, Jihoon Jun, Sejune Joo, Joel Jang, Kyoung-Woon On, Minjoon Seo

TL;DR

This work introduces a semiparametric token-sequence co-supervision training method which trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space.

Abstract

In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.

Semiparametric Token-Sequence Co-Supervision

TL;DR

Abstract

Paper Structure (54 sections, 8 equations, 7 figures, 14 tables)

This paper contains 54 sections, 8 equations, 7 figures, 14 tables.

Introduction
Related Works
Aligning two different models
Language Models with Nonparametric Embeddings
Semiparametric Token-Sequence Co-Supervision
Revisiting Next Token Prediction
Next Sequence Prediction
Co-Supervision
Implementation Details
Problem Setup
Training
Inference
Experiments Setup
Baseline
Metric
...and 39 more sections

Figures (7)

Figure 1: While previous methods train language models with next token prediction loss (NTP), semiparametric token-sequence co-supervision trains a language model in a multi-task manner where supervision from parametric token embedding space (NTP) and supervision from nonparametric sequence embedding space (NSP) flow simultaneously.
Figure 2: Overview of semiparametric token-sequence co-supervision. Gen is an autoregressive LM with LM head on top, which is trained with co-supervision over parametric token embedding space ($L_{\text{NTP}}$) and nonparametric sequence embedding space ($L_{\text{NSP}}$). $\texttt{Emb}_{seq}$, another autoregressive LM constructs nonparametric sequence embedding space with the output embeddings when given sequence as input. $t_i$ indicates tokens, $h$ indicates dimension size of hidden state, and $M$ indicates number of sequences in a batch (Refer Appendix \ref{['app: train']} for datailed calculation).
Figure 3: Reduction rate of correctness when considering those correct by parametric knowledge as wrong.
Figure 4: Overall performance of how different $\texttt{Emb}_{seq}$, which constructs the nonparametric sequence embedding space, affects the overall performance when training with NTP + NSP. We experiment over 3 different models, GPT2-large, TinyLlama, Llama2-7B.
Figure 5: Average performance of each metric over 8 datasets in KILT when changing weight parameter $\lambda$ of NTP + NSP.
...and 2 more figures

Semiparametric Token-Sequence Co-Supervision

TL;DR

Abstract

Semiparametric Token-Sequence Co-Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (7)