Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties
Zixin Tang, Chieh-Yang Huang, Tsung-Che Li, Ho Yin Sam Ng, Hen-Hsen Huang, Ting-Hao 'Kenneth' Huang
TL;DR
The paper tackles cross-variety performance gaps in large language models by proposing a cost-effective benchmarking approach that leverages contextually aligned online reviews. It constructs a paired TW-CN Mandarin dataset from Booking.com reviews—22,918 hotel-review pairs—through careful matching on hotel, rating class, and text length, and validates data quality with human judgments. Six LLMs are then evaluated on sentiment-rating prediction under structured, plain, and shuffled inputs, revealing a robust underperformance for Taiwan Mandarin, particularly when input structure is less informative or texts are shorter. The work also interrogates confounding factors and MT-based approaches, showing that translation biases do not eliminate the gap. Overall, the study demonstrates a scalable method to benchmark and diagnose language-variety biases in LLMs, with implications for extending to additional varieties and refining labeling and alignment techniques.
Abstract
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
