Assessing the Impact of Sequence Length Learning on Classification Tasks for Transformer Encoder Models
Jean-Thomas Baillargeon, Luc Lamontagne
TL;DR
This paper addresses the problem that sequence length differences between classes can serve as a spurious predictive feature in transformer-based text classifiers. It introduces an empirical protocol to inject and detect sequence length learning across four datasets and several transformer architectures. Two data-centric mitigation strategies are evaluated: removing observations outside the overlapping length region and data augmentation using a masked language model to increase overlap. Findings show that length-based shortcuts can dominate predictions under imbalanced length distributions, but mitigation can reduce reliance on length and restore content-based decision making, with practical implications for private-domain NLP where length bias may be present.
Abstract
Classification algorithms using Transformer architectures can be affected by the sequence length learning problem whenever observations from different classes have a different length distribution. This problem causes models to use sequence length as a predictive feature instead of relying on important textual information. Although most public datasets are not affected by this problem, privately owned corpora for fields such as medicine and insurance may carry this data bias. The exploitation of this sequence length feature poses challenges throughout the value chain as these machine learning models can be used in critical applications. In this paper, we empirically expose this problem and present approaches to minimize its impacts.
