Table of Contents
Fetching ...

VideoNorms: Benchmarking Cultural Awareness of Video Language Models

Nikhil Reddy Varimalla, Yunfei Xu, Arkadiy Saakyan, Meng Fan Wang, Smaranda Muresan

TL;DR

VideoNorms introduces a cross-cultural benchmark for evaluating cultural awareness in video-language models, pairing US and Chinese norms with 15-second video clips and three evaluation tasks. A three-stage workflow—clip selection, speech-act theory prompted teacher annotations, and human verification—produces a dataset of over 1000 (video clip, norm) pairs with verbal and non-verbal evidence and norm-generation outputs. The authors benchmark open-weight VideoLLMs and reveal systematic gaps: norm violation detection is harder than adherence, CN norms are harder than US norms, non-verbal cues are underutilized, and formal contexts exacerbate difficulties, with models largely similar in performance. The work highlights the need for culturally grounded training and larger-scale cross-cultural evaluation to ensure VideoLLMs generalize across global contexts and social norms.

Abstract

As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

VideoNorms: Benchmarking Cultural Awareness of Video Language Models

TL;DR

VideoNorms introduces a cross-cultural benchmark for evaluating cultural awareness in video-language models, pairing US and Chinese norms with 15-second video clips and three evaluation tasks. A three-stage workflow—clip selection, speech-act theory prompted teacher annotations, and human verification—produces a dataset of over 1000 (video clip, norm) pairs with verbal and non-verbal evidence and norm-generation outputs. The authors benchmark open-weight VideoLLMs and reveal systematic gaps: norm violation detection is harder than adherence, CN norms are harder than US norms, non-verbal cues are underutilized, and formal contexts exacerbate difficulties, with models largely similar in performance. The work highlights the need for culturally grounded training and larger-scale cross-cultural evaluation to ensure VideoLLMs generalize across global contexts and social norms.

Abstract

As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

Paper Structure

This paper contains 38 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: VideoNorms Dataset Construction: left panel shows teacher VideoLLM generations using speech act theory prompting; right panel shows the expert annotator editing process.
  • Figure 2: Examples of Gemini-generated normative behavior annotations and corresponding human refinements for US (left) and Chinese (right) shows, as recorded through the annotation interface.
  • Figure 3: F1 score distributions with 95% CIs by norm category for the US, for Tasks 1, 2 (non-verbal), 3.
  • Figure 4: Detailed instructions provided on the first page of the user interface. The page is cut into 5 screenshots, from left to right, in this figure.
  • Figure 5: User Interface shown to annotators. It comprises of a 15 second clip with an option to edit the predicted norm from Teacher model.