The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb
TL;DR
The Visual Iconicity Challenge introduces a video-based benchmark to test vision–language models on sign-language form–meaning grounding, focusing on phonological form prediction, sign transparency, and graded iconicity using 96 NGT signs with ground-truth annotations and human baselines. By evaluating 13 VLMs in zero-shot and few-shot settings, the study reveals that while larger models capture some phonological structure and moderate iconicity signals, they struggle to infer lexical meanings and to match human translucency in sign transparency, often biasing toward object-based visual similarity. A key finding is that stronger phonological representations correlate with closer alignment to human iconicity judgments, suggesting shared visually grounded structure but highlighting the gap in embodied grounding lacking in current models. The results motivate the integration of human-centric signals and embodied learning—such as structured pose information and sign-specific descriptors—to improve visual grounding and iconicity modelling in multimodal systems.
Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
