Table of Contents
Fetching ...

Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties

Zixin Tang, Chieh-Yang Huang, Tsung-Che Li, Ho Yin Sam Ng, Hen-Hsen Huang, Ting-Hao 'Kenneth' Huang

TL;DR

The paper tackles cross-variety performance gaps in large language models by proposing a cost-effective benchmarking approach that leverages contextually aligned online reviews. It constructs a paired TW-CN Mandarin dataset from Booking.com reviews—22,918 hotel-review pairs—through careful matching on hotel, rating class, and text length, and validates data quality with human judgments. Six LLMs are then evaluated on sentiment-rating prediction under structured, plain, and shuffled inputs, revealing a robust underperformance for Taiwan Mandarin, particularly when input structure is less informative or texts are shorter. The work also interrogates confounding factors and MT-based approaches, showing that translation biases do not eliminate the gap. Overall, the study demonstrates a scalable method to benchmark and diagnose language-variety biases in LLMs, with implications for extending to additional varieties and refining labeling and alignment techniques.

Abstract

A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.

Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties

TL;DR

The paper tackles cross-variety performance gaps in large language models by proposing a cost-effective benchmarking approach that leverages contextually aligned online reviews. It constructs a paired TW-CN Mandarin dataset from Booking.com reviews—22,918 hotel-review pairs—through careful matching on hotel, rating class, and text length, and validates data quality with human judgments. Six LLMs are then evaluated on sentiment-rating prediction under structured, plain, and shuffled inputs, revealing a robust underperformance for Taiwan Mandarin, particularly when input structure is less informative or texts are shorter. The work also interrogates confounding factors and MT-based approaches, showing that translation biases do not eliminate the gap. Overall, the study demonstrates a scalable method to benchmark and diagnose language-variety biases in LLMs, with implications for extending to additional varieties and refining labeling and alignment techniques.

Abstract

A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.

Paper Structure

This paper contains 34 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Online review platforms can be data sources to build datasets that capture comments in different language varieties from similar real-world scenarios. These contextually aligned datasets can then be used to benchmark LLMs' performance across language varieties.
  • Figure 2: Impact of text length on sentiment classification performance. The top graph shows accuracy, and the bottom graph shows MSE for negative, neutral, positive, and overall sentiments across different text lengths (0-500 characters). While overall performance remains relatively stable, individual sentiment categories show varying levels of accuracy and error, particularly for shorter texts.
  • Figure 3: Comparison of accuracy between Mainland Mandarin and Taiwan Mandarin for short (left) and long (right) texts. Each point represents a [model, setting]'s performance. The diagonal line ($x=y$) indicates equal performance. Points above the line suggest better performance in Taiwan Mandarin, while points below suggest better performance in Mainland Mandarin. We do not see a big difference between the short and long texts.
  • Figure 4: Comparison of MSE between Mainland Mandarin and Taiwan Mandarin for short (left) and long (right) texts. Each point represents a model's performance. The diagonal line ($x=y$) indicates equal performance. Points below the line suggest better performance in Taiwan Mandarin, while points above suggest better performance in Mainland Mandarin. Note the larger performance gap for short texts compared to long texts.