Table of Contents
Fetching ...

Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow

Cyrile Delestre, Yoann Sola

TL;DR

The study addresses processing Banking Transaction Flows (BTF), a PSD2-aligned multimodal sequential dataset (date, amount, wording), by applying self-attention through two pre-trained encoders (RNN and Transformer) built on a specialized tokenization. It introduces three self-supervised pre-training tasks (Masked Wording Model, Masked Amount Model, Next Sequence Prediction) and demonstrates that fine-tuning these pre-trained models yields significant performance gains on transaction categorization and credit risk scoring, outperforming strong baselines. The work highlights the importance of preserving multimodal information, analyzes hardware efficiency and memory trade-offs, and confirms the models’ generalization across diverse downstream tasks, with implications for open banking and broader banking analytics. It also outlines future directions such as knowledge distillation and quantization to reduce computational cost while maintaining performance.

Abstract

Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.

Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow

TL;DR

The study addresses processing Banking Transaction Flows (BTF), a PSD2-aligned multimodal sequential dataset (date, amount, wording), by applying self-attention through two pre-trained encoders (RNN and Transformer) built on a specialized tokenization. It introduces three self-supervised pre-training tasks (Masked Wording Model, Masked Amount Model, Next Sequence Prediction) and demonstrates that fine-tuning these pre-trained models yields significant performance gains on transaction categorization and credit risk scoring, outperforming strong baselines. The work highlights the importance of preserving multimodal information, analyzes hardware efficiency and memory trade-offs, and confirms the models’ generalization across diverse downstream tasks, with implications for open banking and broader banking analytics. It also outlines future directions such as knowledge distillation and quantization to reduce computational cost while maintaining performance.

Abstract

Banking Transaction Flow (BTF) is a sequential data found in a number of banking activities such as marketing, credit risk or banking fraud. It is a multimodal data composed of three modalities: a date, a numerical value and a wording. We propose in this work an application of self-attention mechanism to the processing of BTFs. We trained two general models on a large amount of BTFs in a self-supervised way: one RNN-based model and one Transformer-based model. We proposed a specific tokenization in order to be able to process BTFs. The performance of these two models was evaluated on two banking downstream tasks: a transaction categorization task and a credit risk task. The results show that fine-tuning these two pre-trained models allowed to perform better than the state-of-the-art approaches for both tasks.

Paper Structure

This paper contains 18 sections, 10 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Global models diagram and their pre-training heads.
  • Figure 2: Scheme of the three quantizers composing the tokenizer of the amounts. Transaction amount is represented as a function of ids tokens or steps.
  • Figure 3: Structure type with an encoder. The green (resp. blue) boxes represent the first (resp. second) sequence, the ovals the attention process and the red boxes the output of the models.
  • Figure 4: Contribution impact of each modalities.
  • Figure A.1: Distribution of 10k sequences based on one month of bank transactions.
  • ...and 6 more figures