Understanding attention-based encoder-decoder networks: a case study with chess scoresheet recognition
Sergio Y. Hayashi, Nina S. T. Hirata
TL;DR
This study probes why encoder-decoder networks with attention learn for end-to-end reading of handwritten chess scoresheets, shifting focus from final accuracy to understanding learning dynamics. By decomposing the task into three subtasks—alignment, predictability (language context), and recognition—the authors experimentally analyze how factors like teacher forcing, training set size, image resolution, and sequence length affect each subtask and their interactions. They show clear relationships: alignment is essential for recognition, predictability can dominate when data are scarce, and image quality critically impacts recognition, with attention density playing a key role. An incremental training approach using synthetic data substantially improves performance on real images (up to 79.27% accuracy on length-16 sequences) and highlights the potential for curriculum-like strategies to generalize to new data; the work offers practical guidance for training attention-based image-to-sequence systems in low-data regimes.
Abstract
Deep neural networks are largely used for complex prediction tasks. There is plenty of empirical evidence of their successful end-to-end training for a diversity of tasks. Success is often measured based solely on the final performance of the trained network, and explanations on when, why and how they work are less emphasized. In this paper we study encoder-decoder recurrent neural networks with attention mechanisms for the task of reading handwritten chess scoresheets. Rather than prediction performance, our concern is to better understand how learning occurs in these type of networks. We characterize the task in terms of three subtasks, namely input-output alignment, sequential pattern recognition, and handwriting recognition, and experimentally investigate which factors affect their learning. We identify competition, collaboration and dependence relations between the subtasks, and argue that such knowledge might help one to better balance factors to properly train a network.
