Improving Length-Generalization in Transformers via Task Hinting
Pranjal Awasthi, Anupam Gupta
TL;DR
Transformers struggle with length generalization on multi-step reasoning tasks. The authors propose task hinting via multitask learning, pairing a main sorting task with an auxiliary successor task to induce inductive biases that improve extrapolation to longer sequences. They provide a theoretical construction for a shallow transformer that embodies copy, min, and Identity+Successor primitives and introduce length-dependent tempered softmax, both of which enhance generalization; empirical results show dramatic gains for sorting (up to 92.6% on length $100$ with repetitions) and meaningful improvements on an increment task. The work demonstrates a general training-time technique to bolster out-of-distribution robustness of transformers, with potential applicability beyond sorting and to broader reasoning tasks.
Abstract
It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well. We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most $20$, and improve the test accuracy on sequences of length $100$ from less than 1% (for standard training) to more than 92% (via task hinting). Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.
