Table of Contents
Fetching ...

Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks

Zhenhailong Wang, Xiaoman Pan, Dian Yu, Dong Yu, Jianshu Chen, Heng Ji

TL;DR

Zemi presents a zero-shot semi-parametric language model that integrates retrieval from a large task-agnostic corpus with a novel fusion mechanism. By extending multitask prompted training to include multiple retrieved augmentations via a perceiver resampler and gated cross-attention, it achieves strong zero-shot generalization while remaining more compact than large fully-parametric models. Empirical results show Zemi_LARGE outperforms T0-3B by about 16% across seven tasks and up to ~3.8x reduction in parameters, highlighting the potential of retrieval-augmented multitask learning. The work also provides extensive ablations and overhead analyses, offering insights into how to balance augmentation noise with salience through architecture design and training choices.

Abstract

Although large language models have achieved impressive zero-shot ability, the huge model size generally incurs high cost. Recently, semi-parametric language models, which augment a smaller language model with an external retriever, have demonstrated promising language modeling capabilities. However, it remains unclear whether such semi-parametric language models can perform competitively well as their fully-parametric counterparts on zero-shot generalization to downstream tasks. In this work, we introduce $\text{Zemi}$, a zero-shot semi-parametric language model. To our best knowledge, this is the first semi-parametric language model that can demonstrate strong zero-shot performance on a wide range of held-out unseen tasks. We train $\text{Zemi}$ with a novel semi-parametric multitask prompted training paradigm, which shows significant improvement compared with the parametric multitask training as proposed by T0. Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus. In order to incorporate multiple potentially noisy retrieved augmentations, we further propose a novel $\text{augmentation fusion}$ module leveraging perceiver resampler and gated cross-attention. Notably, our proposed $\text{Zemi}_\text{LARGE}$ outperforms T0-3B by 16% on all seven evaluation tasks while being 3.9x smaller in model size.

Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks

TL;DR

Zemi presents a zero-shot semi-parametric language model that integrates retrieval from a large task-agnostic corpus with a novel fusion mechanism. By extending multitask prompted training to include multiple retrieved augmentations via a perceiver resampler and gated cross-attention, it achieves strong zero-shot generalization while remaining more compact than large fully-parametric models. Empirical results show Zemi_LARGE outperforms T0-3B by about 16% across seven tasks and up to ~3.8x reduction in parameters, highlighting the potential of retrieval-augmented multitask learning. The work also provides extensive ablations and overhead analyses, offering insights into how to balance augmentation noise with salience through architecture design and training choices.

Abstract

Although large language models have achieved impressive zero-shot ability, the huge model size generally incurs high cost. Recently, semi-parametric language models, which augment a smaller language model with an external retriever, have demonstrated promising language modeling capabilities. However, it remains unclear whether such semi-parametric language models can perform competitively well as their fully-parametric counterparts on zero-shot generalization to downstream tasks. In this work, we introduce , a zero-shot semi-parametric language model. To our best knowledge, this is the first semi-parametric language model that can demonstrate strong zero-shot performance on a wide range of held-out unseen tasks. We train with a novel semi-parametric multitask prompted training paradigm, which shows significant improvement compared with the parametric multitask training as proposed by T0. Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus. In order to incorporate multiple potentially noisy retrieved augmentations, we further propose a novel module leveraging perceiver resampler and gated cross-attention. Notably, our proposed outperforms T0-3B by 16% on all seven evaluation tasks while being 3.9x smaller in model size.
Paper Structure (33 sections, 2 equations, 11 figures, 6 tables)

This paper contains 33 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of the semi-parametric multitask prompted training. Each training and evaluation instance is formatted with unified text-to-text prompt templates t0promptsource. In this work, we further augment the prompted instances with retrieved passages from a large-scale task-agnostic corpus, C4 t0, which is the same unlabeled pretraining corpus used in T5 t5 and T0 t0. An example of the prompted input and the retrieved documents can be found in Figure \ref{['fig:architecture']}.
  • Figure 2: Zemi model architecture with an example of a prompted input and a generated output from the Piqa piqa task. The italic text in the prompted input $I$ indicates the prompt template. $A_1$ and $A_k$ shows two examples of the corresponding retrieved augmentations (documents) from the C4 corpus. To incorporate the potentially noisy retrieved augmentations, we introduce a light-weight retrieval-augmentation fusion module that contains two major components, a single layer perceiver resampler and a single layer gated cross-attention (detailed on the right).
  • Figure 3: Example of good and noisy retrieved augmentations. See Appendix \ref{['sec:retrieval_examples']} for more examples.
  • Figure 4: Example of retrieved documents on HellaSwag.
  • Figure 5: Example of retrieved documents on OpenbookQA.
  • ...and 6 more figures