Table of Contents
Fetching ...

Semiparametric Token-Sequence Co-Supervision

Hyunji Lee, Doyoung Kim, Jihoon Jun, Sejune Joo, Joel Jang, Kyoung-Woon On, Minjoon Seo

TL;DR

This work introduces a semiparametric token-sequence co-supervision training method which trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space.

Abstract

In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.

Semiparametric Token-Sequence Co-Supervision

TL;DR

This work introduces a semiparametric token-sequence co-supervision training method which trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space.

Abstract

In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.
Paper Structure (54 sections, 8 equations, 7 figures, 14 tables)

This paper contains 54 sections, 8 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: While previous methods train language models with next token prediction loss (NTP), semiparametric token-sequence co-supervision trains a language model in a multi-task manner where supervision from parametric token embedding space (NTP) and supervision from nonparametric sequence embedding space (NSP) flow simultaneously.
  • Figure 2: Overview of semiparametric token-sequence co-supervision. Gen is an autoregressive LM with LM head on top, which is trained with co-supervision over parametric token embedding space ($L_{\text{NTP}}$) and nonparametric sequence embedding space ($L_{\text{NSP}}$). $\texttt{Emb}_{seq}$, another autoregressive LM constructs nonparametric sequence embedding space with the output embeddings when given sequence as input. $t_i$ indicates tokens, $h$ indicates dimension size of hidden state, and $M$ indicates number of sequences in a batch (Refer Appendix \ref{['app: train']} for datailed calculation).
  • Figure 3: Reduction rate of correctness when considering those correct by parametric knowledge as wrong.
  • Figure 4: Overall performance of how different $\texttt{Emb}_{seq}$, which constructs the nonparametric sequence embedding space, affects the overall performance when training with NTP + NSP. We experiment over 3 different models, GPT2-large, TinyLlama, Llama2-7B.
  • Figure 5: Average performance of each metric over 8 datasets in KILT when changing weight parameter $\lambda$ of NTP + NSP.
  • ...and 2 more figures