Introduction to Sequence Modeling with Transformers
Joni-Kristian Kämäräinen
TL;DR
This work investigates how Transformer models can be made to work for sequence-to-sequence tasks by isolating and studying components beyond attention. Beginning with a plain Transformer baseline, it demonstrates why simple setups tend to converge to the input mean and then progressively adds token embedding, tokenization, masking, positional encoding, and padding to achieve correct sequence mappings on binary data. It highlights the necessity of aligning training and inference (via future masking) and stabilizing training (via pre-attention normalization), illustrating practical design choices with targeted experiments. The result is a didactic, end-to-end exploration that clarifies the role of each component and provides a hands-on Jupyter notebook to help practitioners apply these elements in real-world Transformer-based sequence modeling.
Abstract
Understanding the transformer architecture and its workings is essential for machine learning (ML) engineers. However, truly understanding the transformer architecture can be demanding, even if you have a solid background in machine learning or deep learning. The main working horse is attention, which yields to the transformer encoder-decoder structure. However, putting attention aside leaves several programming components that are easy to implement but whose role for the whole is unclear. These components are 'tokenization', 'embedding' ('un-embedding'), 'masking', 'positional encoding', and 'padding'. The focus of this work is on understanding them. To keep things simple, the understanding is built incrementally by adding components one by one, and after each step investigating what is doable and what is undoable with the current model. Simple sequences of zeros (0) and ones (1) are used to study the workings of each step.
