Table of Contents
Fetching ...

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

John Cooper, Ilias Diakonikolas, Mingchen Ma, Frederic Sala

TL;DR

This paper empirically shows that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters, and demonstrates that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

Abstract

Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

TL;DR

This paper empirically shows that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters, and demonstrates that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

Abstract

Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.
Paper Structure (26 sections, 14 theorems, 41 equations, 16 figures, 2 tables)

This paper contains 26 sections, 14 theorems, 41 equations, 16 figures, 2 tables.

Key Result

Theorem 3.3

Let $F$ be a function defined as in def function composition that satisfies asp ssm lb. There is a distribution $D$ over the input $(u,v)$ such that any model $M$ that is a composition of $k$ state space layers $\mathrm{SSM}_i$, with state space $\mathcal{S}_i, i \in [k]$ that can compute $F$ with p

Figures (16)

  • Figure 1: Example function composition task. The answer to a learned question only depends on a part of the long context input.
  • Figure 2: The construction's style follows taking an input $x$ and implementing 2 functions $u, v$ with an SSM. Typically, $u$ is a truncation of the input, and $v$ is a control parameter (represented in purple). Lastly, a Transformer combines these by implementing $F$ to perform the complete task (represented in red).
  • Figure 3: The construction solving selective copy takes an input sequence and finds the most recent number token (as represented in the bottom squares of the output of the SSM). The Transformer can then use these to look back some relative distance to find the correct token to output.
  • Figure 4: Results from training small models on Selective Copy, across an increase in the hidden dimension of the models. At 2000 parameters, hybrid models consistently attain perfect accuracy. The pure models, with 6x the parameters, only attain around 0.9 accuracy.
  • Figure 5: Results from training small models on Associative Recall with Decoding. Even at much smaller scales than the pure models, the hybrid is the only architecture that attains 0.5 accuracy. At the scales tested, none of the pure models performed the task with more than 0.4 accuracy.
  • ...and 11 more figures

Theorems & Definitions (34)

  • Definition 3.1: Function Composition
  • Theorem 3.3
  • Remark 3.4
  • Lemma 3.5
  • Theorem 3.7
  • Definition 4.1: Selective Copying
  • Theorem 4.2
  • Theorem 4.3
  • Definition 4.4: Associative Recall with Decoding
  • Theorem 4.5
  • ...and 24 more