Table of Contents
Fetching ...

Specialized text classification: an approach to classifying Open Banking transactions

Duc Tuyen TA, Wajdi Ben Saad, Ji Young Oh

TL;DR

This work addresses Open Banking transaction classification for French data under PSD2 by proposing a language-aware pipeline spanning data collection, preprocessing, labeling, and modeling. It compares a TF-IDF/Linear SVM approach with a Word2Vec/Random Forest setup, highlighting the value of language-specific preprocessing and a single, scalable multi-class classifier. On a large, imbalanced French dataset with 84 categories, the Word2Vec+RF setup achieves about 95% weighted precision/recall/F1, while TF-IDF with SVM also performs strongly, though some underrepresented categories pose challenges. The study demonstrates that targeted preprocessing and a streamlined modeling approach can deliver practical, efficient transaction classification useful for fraud prevention and customer insight in Open Banking contexts.

Abstract

With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches.

Specialized text classification: an approach to classifying Open Banking transactions

TL;DR

This work addresses Open Banking transaction classification for French data under PSD2 by proposing a language-aware pipeline spanning data collection, preprocessing, labeling, and modeling. It compares a TF-IDF/Linear SVM approach with a Word2Vec/Random Forest setup, highlighting the value of language-specific preprocessing and a single, scalable multi-class classifier. On a large, imbalanced French dataset with 84 categories, the Word2Vec+RF setup achieves about 95% weighted precision/recall/F1, while TF-IDF with SVM also performs strongly, though some underrepresented categories pose challenges. The study demonstrates that targeted preprocessing and a streamlined modeling approach can deliver practical, efficient transaction classification useful for fraud prevention and customer insight in Open Banking contexts.

Abstract

With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches.

Paper Structure

This paper contains 21 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Illustration of the text preprocessing and human name detection process for banking transaction description.