Table of Contents
Fetching ...

Identifying tax evasion in Mexico with tools from network science and machine learning

Martin Zumaya, Rita Guerrero, Eduardo Islas, Omar Pineda, Carlos Gershenson, Gerardo Iñiguez, Carlos Pineda

TL;DR

This work tackles Mexican tax evasion detection by leveraging the CFDI transaction network and combining network-science with two ML approaches: a DRNN/LSTM and a Random Forest. The study shows high classification performance (>0.9 accuracy) in identifying EFOS-like behavior and reveals network-structure signatures, such as strongly connected components and proximity-based collusion indicators, that differentiate evaders from the broader taxpayer base. By merging ML outputs with complex-network metrics, the authors produce prioritized suspect lists and yearly VAT-evasion estimates (on the order of tens of billions of MXN per year) to guide SAT investigations. The approach demonstrates the potential of large-scale transactional data to support law-enforcement and policy decisions, while acknowledging limitations and the need for human validation and ethical considerations.

Abstract

Mexico has kept electronic records of all taxable transactions since 2014. Anonymized data collected by the Mexican federal government comprises more than 80 million contributors (individuals and companies) and almost 7 billion monthly-aggregations of invoices among contributors between January 2015 and December 2018. This data includes a list of almost ten thousand contributors already identified as tax evaders, due to their activities fabricating invoices for non-existing products or services so that recipients can evade taxes. Harnessing this extensive dataset, we build monthly and yearly temporal networks where nodes are contributors and directed links are invoices produced in a given time slice. Exploring the properties of the network neighborhoods around tax evaders, we show that their interaction patterns differ from those of the majority of contributors. In particular, invoicing loops between tax evaders and their clients are over-represented. With this insight, we use two machine-learning methods to classify other contributors as suspects of tax evasion: deep neural networks and random forests. We train each method with a portion of the tax evader list and test it with the rest, obtaining more than 0.9 accuracy with both methods. By using the complete dataset of contributors, each method classifies more than 100 thousand suspects of tax evasion, with more than 40 thousand suspects classified by both methods. We further reduce the number of suspects by focusing on those with a short network distance from known tax evaders. We thus obtain a list of highly suspicious contributors sorted by the amount of evaded tax, valuable information for the authorities to further investigate illegal tax activity in Mexico. With our methods, we estimate previously undetected tax evasion in the order of \$10 billion USD per year by about 10 thousand contributors.

Identifying tax evasion in Mexico with tools from network science and machine learning

TL;DR

This work tackles Mexican tax evasion detection by leveraging the CFDI transaction network and combining network-science with two ML approaches: a DRNN/LSTM and a Random Forest. The study shows high classification performance (>0.9 accuracy) in identifying EFOS-like behavior and reveals network-structure signatures, such as strongly connected components and proximity-based collusion indicators, that differentiate evaders from the broader taxpayer base. By merging ML outputs with complex-network metrics, the authors produce prioritized suspect lists and yearly VAT-evasion estimates (on the order of tens of billions of MXN per year) to guide SAT investigations. The approach demonstrates the potential of large-scale transactional data to support law-enforcement and policy decisions, while acknowledging limitations and the need for human validation and ethical considerations.

Abstract

Mexico has kept electronic records of all taxable transactions since 2014. Anonymized data collected by the Mexican federal government comprises more than 80 million contributors (individuals and companies) and almost 7 billion monthly-aggregations of invoices among contributors between January 2015 and December 2018. This data includes a list of almost ten thousand contributors already identified as tax evaders, due to their activities fabricating invoices for non-existing products or services so that recipients can evade taxes. Harnessing this extensive dataset, we build monthly and yearly temporal networks where nodes are contributors and directed links are invoices produced in a given time slice. Exploring the properties of the network neighborhoods around tax evaders, we show that their interaction patterns differ from those of the majority of contributors. In particular, invoicing loops between tax evaders and their clients are over-represented. With this insight, we use two machine-learning methods to classify other contributors as suspects of tax evasion: deep neural networks and random forests. We train each method with a portion of the tax evader list and test it with the rest, obtaining more than 0.9 accuracy with both methods. By using the complete dataset of contributors, each method classifies more than 100 thousand suspects of tax evasion, with more than 40 thousand suspects classified by both methods. We further reduce the number of suspects by focusing on those with a short network distance from known tax evaders. We thus obtain a list of highly suspicious contributors sorted by the amount of evaded tax, valuable information for the authorities to further investigate illegal tax activity in Mexico. With our methods, we estimate previously undetected tax evasion in the order of \$10 billion USD per year by about 10 thousand contributors.

Paper Structure

This paper contains 25 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: One part of neural network $A$ observes an input $x_t$ and calculates a value $h_t$. The cycle allows information to flow from one step of the network to the next. If we unroll the cycle, one recurrent neural network can be considered as multiple copies of the same network, such that each one passes a message to a successor.
  • Figure 2: Histograms of the probabilities assigned by the neural network to different sets of RFCAs. (left) We observe that the network correctly assigns most of them a high probability of being EFOS. (right) We observe a bimodal distribution in which there is a considerable percentage of RFCAs that are assigned a high probability of being EFOS.
  • Figure 3: Directed links in the network correspond to a CFDI emitted between RFCA. This digital receipts (CFDI) can represent incomes and outcomes.
  • Figure 4: Examples of the strongly connected components observed in interaction networks at year timescale, the left panel corresponds to 2015 and the right panel to 2016. Red nodes correspond to confirmed EFOS, yellow to suspect EFOS and blue nodes to unclassified RFCA.
  • Figure 5: Time behaviour of the logarithm of the subtotal of the transactions associated to emissions from EFOS (left panel) and alleged EFOS (right panel), the vertical dashed lines indicate the month of December of each year. Solid lines show the median and the dashed lines the interquartile range of the distribution. It can be seen that the transactions between EFOS, either definitive or alleged, correspond to amounts around tens of thousands and millions of pesos. We define this range of amounts as the EFOS activity regime, which we use to filter the links in the interaction networks.
  • ...and 7 more figures