Table of Contents
Fetching ...

Unmask It! AI-Generated Product Review Detection in Dravidian Languages

Somsubhra De, Advait Vats

TL;DR

This work addresses the detection of AI-generated product reviews in low-resource Dravidian languages by evaluating a broad spectrum of methods from traditional ML to state-of-the-art transformers (e.g., IndicSBERT, MuRIL, XLM-RoBERTa, Malayalam-BERT). Through Tamil and Malayalam datasets, the study demonstrates that transformer-based approaches substantially outperform traditional methods and DL architectures, with IndicSBERT excelling for Tamil and Malayalam-BERT for Malayalam, while some exceptional transformer runs show near-perfect precision/recall. Qualitative analyses reveal language-specific patterns in AI vs human reviews and highlight the value of human-in-the-loop insights. The findings emphasize the practical potential of transformer-based detectors to improve trust in e-commerce platforms for under-resourced languages, while outlining future work on LLMs, ensemble strategies, larger diverse corpora, and ethical considerations. Overall, the paper advances AI-generated content detection in Dravidian languages and provides actionable guidance for deploying robust detection in real-world, multilingual marketplaces.

Abstract

The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and MalayalamBERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.

Unmask It! AI-Generated Product Review Detection in Dravidian Languages

TL;DR

This work addresses the detection of AI-generated product reviews in low-resource Dravidian languages by evaluating a broad spectrum of methods from traditional ML to state-of-the-art transformers (e.g., IndicSBERT, MuRIL, XLM-RoBERTa, Malayalam-BERT). Through Tamil and Malayalam datasets, the study demonstrates that transformer-based approaches substantially outperform traditional methods and DL architectures, with IndicSBERT excelling for Tamil and Malayalam-BERT for Malayalam, while some exceptional transformer runs show near-perfect precision/recall. Qualitative analyses reveal language-specific patterns in AI vs human reviews and highlight the value of human-in-the-loop insights. The findings emphasize the practical potential of transformer-based detectors to improve trust in e-commerce platforms for under-resourced languages, while outlining future work on LLMs, ensemble strategies, larger diverse corpora, and ethical considerations. Overall, the paper advances AI-generated content detection in Dravidian languages and provides actionable guidance for deploying robust detection in real-world, multilingual marketplaces.

Abstract

The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and MalayalamBERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.

Paper Structure

This paper contains 16 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Sample Tamil and Malayalam Texts from the Training Set with English Translations* for Context (Note* The English translations may not fully capture the nuances, sentiment and cultural context inherent in the original Tamil and Malayalam texts. As a result, the English version might not reflect the true tone or intention.)
  • Figure 2: Confusion Matrix for the best runs on Tamil & Malayalam test sets (using IndicSBERT & Malayalam-BERT respectively)
  • Figure 3: Misclassified Tamil texts from the test set alongside the confusion matrix: XLM-R incorrectly predicted these 6 human-written reviews as AI-generated.
  • Figure 4: Most common words in AI-generated (left) & human-written (right) reviews on the Tamil train set
  • Figure 5: Most common words in AI-generated (left) & human-written (right) reviews on the Tamil test set
  • ...and 4 more figures