Table of Contents
Fetching ...

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

Silvia García-Méndez, Milagros Fernández-Gavilanes, Jonathan Juncal-Martínez, Francisco J. González-Castaño, Oscar Barba Seara

TL;DR

The paper tackles the problem of classifying short, mnemonic banking transaction descriptions for personal finance management, a domain where traditional long-text methods struggle. It presents a three-stage system that combines domain-specific lexica, word and character n-grams, and meta-features (amount and date) with a one-versus-one SVM classifier, augmented by a Jaccard-distance similarity detector to reduce training data. The model is trained and evaluated on a real, 30,844-entry labelled corpus spanning 15 categories, with macro-averaged precision, recall, and F1 used as metrics; results show that lexicon features significantly boost precision and that the approach can outperform strong CNN-based baselines, while dramatically reducing training time. The proposed method is demonstrated in a real-world use case via CoinScrap, illustrating practical applicability for PFMs and potential for data sharing with researchers. Overall, the work provides a compact, efficient framework for short-text banking classification that balances accuracy, speed, and privacy, making it suitable for deployment in financial apps and for open research data sharing upon request.

Abstract

Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

TL;DR

The paper tackles the problem of classifying short, mnemonic banking transaction descriptions for personal finance management, a domain where traditional long-text methods struggle. It presents a three-stage system that combines domain-specific lexica, word and character n-grams, and meta-features (amount and date) with a one-versus-one SVM classifier, augmented by a Jaccard-distance similarity detector to reduce training data. The model is trained and evaluated on a real, 30,844-entry labelled corpus spanning 15 categories, with macro-averaged precision, recall, and F1 used as metrics; results show that lexicon features significantly boost precision and that the approach can outperform strong CNN-based baselines, while dramatically reducing training time. The proposed method is demonstrated in a real-world use case via CoinScrap, illustrating practical applicability for PFMs and potential for data sharing with researchers. Overall, the work provides a compact, efficient framework for short-text banking classification that balances accuracy, speed, and privacy, making it suitable for deployment in financial apps and for open research data sharing upon request.

Abstract

Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
Paper Structure (27 sections, 5 equations, 4 figures, 16 tables)

This paper contains 27 sections, 5 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: System stages.
  • Figure 2: Flow diagram of the system with Jaccard similarity detector.
  • Figure 3: uml diagram of the lexicon generation procedure.
  • Figure 4: Coinscrap app.