Table of Contents
Fetching ...

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

TL;DR

The Visual Iconicity Challenge introduces a video-based benchmark to test vision–language models on sign-language form–meaning grounding, focusing on phonological form prediction, sign transparency, and graded iconicity using 96 NGT signs with ground-truth annotations and human baselines. By evaluating 13 VLMs in zero-shot and few-shot settings, the study reveals that while larger models capture some phonological structure and moderate iconicity signals, they struggle to infer lexical meanings and to match human translucency in sign transparency, often biasing toward object-based visual similarity. A key finding is that stronger phonological representations correlate with closer alignment to human iconicity judgments, suggesting shared visually grounded structure but highlighting the gap in embodied grounding lacking in current models. The results motivate the integration of human-centric signals and embodied learning—such as structured pose information and sign-specific descriptors—to improve visual grounding and iconicity modelling in multimodal systems.

Abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

TL;DR

The Visual Iconicity Challenge introduces a video-based benchmark to test vision–language models on sign-language form–meaning grounding, focusing on phonological form prediction, sign transparency, and graded iconicity using 96 NGT signs with ground-truth annotations and human baselines. By evaluating 13 VLMs in zero-shot and few-shot settings, the study reveals that while larger models capture some phonological structure and moderate iconicity signals, they struggle to infer lexical meanings and to match human translucency in sign transparency, often biasing toward object-based visual similarity. A key finding is that stronger phonological representations correlate with closer alignment to human iconicity judgments, suggesting shared visually grounded structure but highlighting the gap in embodied grounding lacking in current models. The results motivate the integration of human-centric signals and embodied learning—such as structured pose information and sign-specific descriptors—to improve visual grounding and iconicity modelling in multimodal systems.

Abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

Paper Structure

This paper contains 36 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of the Visual Iconicity Challenge: evaluation pipeline of the sign to-cut in NGT for phonological form prediction, (top right), transparency (bottom left), and iconicity (bottom right) tasks.
  • Figure 2: Examples of an iconic vs. an arbitrary sign, with their annotated phonological form features. The sign telephone is iconic as its form resembles a telephone’s shape, whereas sugar is arbitrary with no clear visual link to its meaning.
  • Figure 3: Zero-shot accuracy per form feature. Solid black lines indicate the human baseline, and dashed grey lines refer to random. Bars show VLMs. Across models, location and handedness are comparatively easy; handshape and path shape are hardest; path repetition is intermediate. Numbers on bars are mean accuracies.
  • Figure 4: Average iconicity ratings by iconicity type (higher = more iconic).
  • Figure 5: Overall model landscape by zero-shot phonological form prediction accuracy and iconicity scores. Top-right are best; dot size encodes model size.
  • ...and 7 more figures