Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow
Cyrile Delestre, Yoann Sola
TL;DR
The study addresses processing Banking Transaction Flows (BTF), a PSD2-aligned multimodal sequential dataset (date, amount, wording), by applying self-attention through two pre-trained encoders (RNN and Transformer) built on a specialized tokenization. It introduces three self-supervised pre-training tasks (Masked Wording Model, Masked Amount Model, Next Sequence Prediction) and demonstrates that fine-tuning these pre-trained models yields significant performance gains on transaction categorization and credit risk scoring, outperforming strong baselines. The work highlights the importance of preserving multimodal information, analyzes hardware efficiency and memory trade-offs, and confirms the models’ generalization across diverse downstream tasks, with implications for open banking and broader banking analytics. It also outlines future directions such as knowledge distillation and quantization to reduce computational cost while maintaining performance.
Abstract
Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.
