Table of Contents
Fetching ...

Improving Length-Generalization in Transformers via Task Hinting

Pranjal Awasthi, Anupam Gupta

TL;DR

Transformers struggle with length generalization on multi-step reasoning tasks. The authors propose task hinting via multitask learning, pairing a main sorting task with an auxiliary successor task to induce inductive biases that improve extrapolation to longer sequences. They provide a theoretical construction for a shallow transformer that embodies copy, min, and Identity+Successor primitives and introduce length-dependent tempered softmax, both of which enhance generalization; empirical results show dramatic gains for sorting (up to 92.6% on length $100$ with repetitions) and meaningful improvements on an increment task. The work demonstrates a general training-time technique to bolster out-of-distribution robustness of transformers, with potential applicability beyond sorting and to broader reasoning tasks.

Abstract

It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well. We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most $20$, and improve the test accuracy on sequences of length $100$ from less than 1% (for standard training) to more than 92% (via task hinting). Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.

Improving Length-Generalization in Transformers via Task Hinting

TL;DR

Transformers struggle with length generalization on multi-step reasoning tasks. The authors propose task hinting via multitask learning, pairing a main sorting task with an auxiliary successor task to induce inductive biases that improve extrapolation to longer sequences. They provide a theoretical construction for a shallow transformer that embodies copy, min, and Identity+Successor primitives and introduce length-dependent tempered softmax, both of which enhance generalization; empirical results show dramatic gains for sorting (up to 92.6% on length with repetitions) and meaningful improvements on an increment task. The work demonstrates a general training-time technique to bolster out-of-distribution robustness of transformers, with potential applicability beyond sorting and to broader reasoning tasks.

Abstract

It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well. We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most , and improve the test accuracy on sequences of length from less than 1% (for standard training) to more than 92% (via task hinting). Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.
Paper Structure (23 sections, 1 theorem, 20 equations, 16 figures, 4 tables)

This paper contains 23 sections, 1 theorem, 20 equations, 16 figures, 4 tables.

Key Result

Theorem 5.1

For any alphabet of size $q$ and bit precision complexity $b$, there exists a depth-2 decoder only transformer model with two attention heads, embedding dimensionality and hidden layer dimensionality of $O(q)$, and network weights encoded using $b$ bits of precision that correctly solves the sorting

Figures (16)

  • Figure 3.1: An example input sequence for decoder only model training. The mask ensures that we only penalize the model for predictions at the output positions.
  • Figure 3.2: Effect of data scaling on length generalization. While performance improves on length $50$ sequences, there is no benefit at higher lengths.
  • Figure 3.3: Effect of model scaling on length generalization. All the models have less than $1\%$ test accuracy for length $100$ sequences.
  • Figure 3.4: An example input sequence for the successor task.
  • Figure 3.5: Effect of data scaling for task hinting. We observe consistent improvements in test accuracy on higher length sequences.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Theorem 5.1
  • proof : Proof of Theorem \ref{['thm:main']}