Table of Contents
Fetching ...

Evaluating Cultural and Social Awareness of LLM Web Agents

Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu

TL;DR

CASA introduces a culturally and socially aware benchmark (CASA) for evaluating LLM web agents across two real-world tasks: online shopping and social discussion forums. It defines an evaluation framework with awareness coverage (AC-R), educational value (Edu-R), helpfulness (Help-R), and violation (Vio-R) metrics, and uses 17 countries to capture cross-cultural variation. Experimental results show current agents struggle in web-based settings, with very low awareness coverage (<10%) and high violation rates (>40%), though prompting and fine-tuning yield complementary gains, particularly when combined. Country-level analyses reveal region-specific differences, with US models performing best and non-US regions needing explicit culturally aware prompting; the work emphasizes the need for ongoing benchmarking and richer data collection to improve cross-cultural sensitivity in LLM agents.

Abstract

As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages -- fine-tuning on culture-specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.

Evaluating Cultural and Social Awareness of LLM Web Agents

TL;DR

CASA introduces a culturally and socially aware benchmark (CASA) for evaluating LLM web agents across two real-world tasks: online shopping and social discussion forums. It defines an evaluation framework with awareness coverage (AC-R), educational value (Edu-R), helpfulness (Help-R), and violation (Vio-R) metrics, and uses 17 countries to capture cross-cultural variation. Experimental results show current agents struggle in web-based settings, with very low awareness coverage (<10%) and high violation rates (>40%), though prompting and fine-tuning yield complementary gains, particularly when combined. Country-level analyses reveal region-specific differences, with US models performing best and non-US regions needing explicit culturally aware prompting; the work emphasizes the need for ongoing benchmarking and richer data collection to improve cross-cultural sensitivity in LLM agents.

Abstract

As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages -- fine-tuning on culture-specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.
Paper Structure (45 sections, 5 figures, 26 tables)

This paper contains 45 sections, 5 figures, 26 tables.

Figures (5)

  • Figure 1: A comparison between evaluation user query from WebArena and the culturally sensitive evaluation user query from our proposed benchmark.
  • Figure 2: Our benchmark Casa uses established cultural and social analysis taxonomies across selected countries to create two scenarios (\ref{['sec:benchmark']}, with more examples in \ref{['tab:benchmark_examples']}). We evaluate LLM agents' responses based on awareness coverage, educational content, helpfulness, and violations (\ref{['sec:evaluation_framework']}).
  • Figure 3: Comparison of various prompting techniques across 17 countries for the S1-Violate (online shopping).
  • Figure 4: The user interface for norm annotation.
  • Figure 5: Our representative countries on the world map.