Table of Contents
Fetching ...

Analysing Public Transport User Sentiment on Low Resource Multilingual Data

Rozina L. Myoya, Vukosi Marivate, Idris Abdulmumin

TL;DR

This study addresses the gap in understanding public transport user experience in Sub-Saharan Africa by applying multilingual sentiment analysis to X data from Kenya, Tanzania, and South Africa. It combines language-aware preprocessing with African-language PLMs (AfriBERTa, AfroXLMR, AfroLM, PuoBERTa) and uses Word2Vec embeddings with K-Means clustering to extract themes while handling code-switching with language identification. Across Kenya and South Africa, results show predominantly negative sentiments related to pricing, safety, and infrastructure, whereas Tanzania exhibits a positive bias largely due to advertising-focused content; these findings highlight context-specific drivers of commuter satisfaction. The work demonstrates the feasibility and utility of NLP for under-resourced languages in transport analytics and suggests future directions such as broader data sources, robust validation, and aspect-based opinion mining to produce actionable insights for improving urban mobility and QoS in SSA.

Abstract

Public transport systems in many Sub-Saharan countries often receive less attention compared to other sectors, underscoring the need for innovative solutions to improve the Quality of Service (QoS) and overall user experience. This study explored commuter opinion mining to understand sentiments toward existing public transport systems in Kenya, Tanzania, and South Africa. We used a qualitative research design, analysing data from X (formerly Twitter) to assess sentiments across rail, mini-bus taxis, and buses. By leveraging Multilingual Opinion Mining techniques, we addressed the linguistic diversity and code-switching present in our dataset, thus demonstrating the application of Natural Language Processing (NLP) in extracting insights from under-resourced languages. We employed PLMs such as AfriBERTa, AfroXLMR, AfroLM, and PuoBERTa to conduct the sentiment analysis. The results revealed predominantly negative sentiments in South Africa and Kenya, while the Tanzanian dataset showed mainly positive sentiments due to the advertising nature of the tweets. Furthermore, feature extraction using the Word2Vec model and K-Means clustering illuminated semantic relationships and primary themes found within the different datasets. By prioritising the analysis of user experiences and sentiments, this research paves the way for developing more responsive, user-centered public transport systems in Sub-Saharan countries, contributing to the broader goal of improving urban mobility and sustainability.

Analysing Public Transport User Sentiment on Low Resource Multilingual Data

TL;DR

This study addresses the gap in understanding public transport user experience in Sub-Saharan Africa by applying multilingual sentiment analysis to X data from Kenya, Tanzania, and South Africa. It combines language-aware preprocessing with African-language PLMs (AfriBERTa, AfroXLMR, AfroLM, PuoBERTa) and uses Word2Vec embeddings with K-Means clustering to extract themes while handling code-switching with language identification. Across Kenya and South Africa, results show predominantly negative sentiments related to pricing, safety, and infrastructure, whereas Tanzania exhibits a positive bias largely due to advertising-focused content; these findings highlight context-specific drivers of commuter satisfaction. The work demonstrates the feasibility and utility of NLP for under-resourced languages in transport analytics and suggests future directions such as broader data sources, robust validation, and aspect-based opinion mining to produce actionable insights for improving urban mobility and QoS in SSA.

Abstract

Public transport systems in many Sub-Saharan countries often receive less attention compared to other sectors, underscoring the need for innovative solutions to improve the Quality of Service (QoS) and overall user experience. This study explored commuter opinion mining to understand sentiments toward existing public transport systems in Kenya, Tanzania, and South Africa. We used a qualitative research design, analysing data from X (formerly Twitter) to assess sentiments across rail, mini-bus taxis, and buses. By leveraging Multilingual Opinion Mining techniques, we addressed the linguistic diversity and code-switching present in our dataset, thus demonstrating the application of Natural Language Processing (NLP) in extracting insights from under-resourced languages. We employed PLMs such as AfriBERTa, AfroXLMR, AfroLM, and PuoBERTa to conduct the sentiment analysis. The results revealed predominantly negative sentiments in South Africa and Kenya, while the Tanzanian dataset showed mainly positive sentiments due to the advertising nature of the tweets. Furthermore, feature extraction using the Word2Vec model and K-Means clustering illuminated semantic relationships and primary themes found within the different datasets. By prioritising the analysis of user experiences and sentiments, this research paves the way for developing more responsive, user-centered public transport systems in Sub-Saharan countries, contributing to the broader goal of improving urban mobility and sustainability.

Paper Structure

This paper contains 17 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Focus language distribution within the full dataset taking into consideration code mixing
  • Figure 2: Main features extracted from the Kenyan dataset
  • Figure 3: Main features extracted from the Tanzanian dataset
  • Figure 4: Main features extracted from the South African dataset
  • Figure 5: Sentiment distribution of Kenyan tweets according to the themes derived from Section \ref{['section3.2']}
  • ...and 2 more figures