Table of Contents
Fetching ...

Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection

Peter Wauyo, Dalia Bwiza, Alain Murara, Edwin Mugume, Eric Umuhoza

TL;DR

The paper tackles fare evasion detection in public transit by fusing CCTV video and audio signals through a Tensor Fusion Network that explicitly models unimodal and cross-modal interactions. It leverages ViViT for video feature extraction and AST for audio, with modality-specific embeddings feeding into a fusion layer that yields a 2,145-dimensional representation for fraud detection. On a Rwanda-focused dataset, the approach achieves 89.5% accuracy, 87.2% precision, and 84.0% recall, outperforming early fusion and unimodal baselines, and ablations show notable gains from embedding preprocessing and cross-modal interactions. The work demonstrates real-time detection capabilities with edge-friendly requirements, suggesting practical impact for reducing revenue loss and improving safety in emerging transit markets, while outlining future work on real-time processing, privacy, and contextual data integration.

Abstract

This research introduces a multimodal system designed to detect fraud and fare evasion in public transportation by analyzing closed circuit television (CCTV) and audio data. The proposed solution uses the Vision Transformer for Video (ViViT) model for video feature extraction and the Audio Spectrogram Transformer (AST) for audio analysis. The system implements a Tensor Fusion Network (TFN) architecture that explicitly models unimodal and bimodal interactions through a 2-fold Cartesian product. This advanced fusion technique captures complex cross-modal dynamics between visual behaviors (e.g., tailgating,unauthorized access) and audio cues (e.g., fare transaction sounds). The system was trained and tested on a custom dataset, achieving an accuracy of 89.5%, precision of 87.2%, and recall of 84.0% in detecting fraudulent activities, significantly outperforming early fusion baselines and exceeding the 75% recall rates typically reported in state-of-the-art transportation fraud detection systems. Our ablation studies demonstrate that the tensor fusion approach provides a 7.0% improvement in the F1 score and an 8.8% boost in recall compared to traditional concatenation methods. The solution supports real-time detection, enabling public transport operators to reduce revenue loss, improve passenger safety, and ensure operational compliance.

Towards fairer public transit: Real-time tensor-based multimodal fare evasion and fraud detection

TL;DR

The paper tackles fare evasion detection in public transit by fusing CCTV video and audio signals through a Tensor Fusion Network that explicitly models unimodal and cross-modal interactions. It leverages ViViT for video feature extraction and AST for audio, with modality-specific embeddings feeding into a fusion layer that yields a 2,145-dimensional representation for fraud detection. On a Rwanda-focused dataset, the approach achieves 89.5% accuracy, 87.2% precision, and 84.0% recall, outperforming early fusion and unimodal baselines, and ablations show notable gains from embedding preprocessing and cross-modal interactions. The work demonstrates real-time detection capabilities with edge-friendly requirements, suggesting practical impact for reducing revenue loss and improving safety in emerging transit markets, while outlining future work on real-time processing, privacy, and contextual data integration.

Abstract

This research introduces a multimodal system designed to detect fraud and fare evasion in public transportation by analyzing closed circuit television (CCTV) and audio data. The proposed solution uses the Vision Transformer for Video (ViViT) model for video feature extraction and the Audio Spectrogram Transformer (AST) for audio analysis. The system implements a Tensor Fusion Network (TFN) architecture that explicitly models unimodal and bimodal interactions through a 2-fold Cartesian product. This advanced fusion technique captures complex cross-modal dynamics between visual behaviors (e.g., tailgating,unauthorized access) and audio cues (e.g., fare transaction sounds). The system was trained and tested on a custom dataset, achieving an accuracy of 89.5%, precision of 87.2%, and recall of 84.0% in detecting fraudulent activities, significantly outperforming early fusion baselines and exceeding the 75% recall rates typically reported in state-of-the-art transportation fraud detection systems. Our ablation studies demonstrate that the tensor fusion approach provides a 7.0% improvement in the F1 score and an 8.8% boost in recall compared to traditional concatenation methods. The solution supports real-time detection, enabling public transport operators to reduce revenue loss, improve passenger safety, and ensure operational compliance.

Paper Structure

This paper contains 22 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Current system for detecting fare evasion in Rwanda: (a) Public buses are equipped with smart card readers that function as payment validators; (b) CCTV cameras installed near the validators capture video footage of passengers as they board and alight; and (c) This footage is streamed in real time to a control room, where personnel monitor multiple feeds simultaneously in an attempt to identify passengers who fail to tap their cards.
  • Figure 2: TFN architecture for multimodal fraud detection in Public Transportation Systems.
  • Figure 3: Multimodal fusion architecture for video and audio fraud detection.
  • Figure 4: Confusion matrix for the Tensor Fusion model.