Table of Contents
Fetching ...

Language Models Need Inductive Biases to Count Inductively

Yingshan Chang, Yonatan Bisk

TL;DR

It is found that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain, and modern RNNs also largely underperform traditional RNNs in generalizing counting inductively.

Abstract

Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.

Language Models Need Inductive Biases to Count Inductively

TL;DR

It is found that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain, and modern RNNs also largely underperform traditional RNNs in generalizing counting inductively.

Abstract

Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.
Paper Structure (35 sections, 7 equations, 6 figures, 10 tables)

This paper contains 35 sections, 7 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Illustration of input-output formats. Every integer, as well as $a$, $b$, $a_1, ..., a_{10}$ are individual tokens. Row 1: Vanilla counting, where each token outputs the count of $a$'s seen from the beginning of the sequence up to itself. Row 2: Vanilla counting augmented with input-output pairs that inform the order of number tokens. Row 3-4: $b$ is the helper token, to be seen with larger counts. $a$ is the main token of interest, to bes seen with restricted counts during training and tested with OOD counts. Row 7-8: $a_1, ..., a_{10}$ are distinct tokens. Each of them should maintain its own counter.
  • Figure A1: Input-output format of the first_tok_homogeneous task, designed for the purpose of examining which types of PE can be leveraged by Transformers to distinguish among identical tokens.
  • Figure A2: PCA of intermediate states for 2L APE/SinePE/SPE Transformers trained on modular counting. This differentiates SPE from APE/SinePE in the ability to count modularly, although all three PEs accomplish fist token recognition.
  • Figure A3: Input-output format of the first_tok_heterogeneous task, designed to demonstrate that a causal Transformer can recognize the first token in a heterogeneous input sequence without the need for PEs.
  • Figure A4: Variation in attention scores influenced by PEs. First, we take the best performing 1L checkpoints trained on the Selective Counting + BOS task and compute attention scores (pre-softmax $qk^T$) between all pairs of tokens. Then we compute the std of attention scores across all possible assignments of PIDs. Hence, the value of each cell in each of the panels visualized here equals to the std of $q_{\mathrm{source}}k_{\mathrm{target}}^T$ values across all PID pairs that the source and target tokens can take (i.e. $\mathrm{PID_{source}} \geq \mathrm{PID_{target}}$ following the causal rule).
  • ...and 1 more figures