Table of Contents
Fetching ...

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

TL;DR

This work introduces VISLA, a benchmark built on triplets ($P_1$, $P_2$, $N$) where $P_1$ and $P_2$ are semantically equivalent and $N$ is semantically opposite but lexically similar, to evaluate semantic and lexical understanding in vision-language and unimodal language models. It provides two datasets, generic and spatial, and evaluates 34 VLMs and 20 ULMs under image-to-text and text-to-text retrieval without fine-tuning, revealing pervasive difficulties in disentangling lexical and semantic variation and heightened sensitivity of spatial semantics to lexical overlap. The study finds that unimodal text encoders underperform in VISLA compared to their multimodal counterparts, while VLM text encoders remain more sensitive to semantics and lexicon than ULMs; gains from model size are limited, whereas larger pretraining data helps generically but not for spatial semantics. Overall, VISLA offers a rigorous diagnostic tool to guide the development of embeddings that robustly capture semantics beyond lexical cues, informing future research on semantic invariance in both multimodal and unimodal language models.

Abstract

Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.

VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

TL;DR

This work introduces VISLA, a benchmark built on triplets (, , ) where and are semantically equivalent and is semantically opposite but lexically similar, to evaluate semantic and lexical understanding in vision-language and unimodal language models. It provides two datasets, generic and spatial, and evaluates 34 VLMs and 20 ULMs under image-to-text and text-to-text retrieval without fine-tuning, revealing pervasive difficulties in disentangling lexical and semantic variation and heightened sensitivity of spatial semantics to lexical overlap. The study finds that unimodal text encoders underperform in VISLA compared to their multimodal counterparts, while VLM text encoders remain more sensitive to semantics and lexicon than ULMs; gains from model size are limited, whereas larger pretraining data helps generically but not for spatial semantics. Overall, VISLA offers a rigorous diagnostic tool to guide the development of embeddings that robustly capture semantics beyond lexical cues, informing future research on semantic invariance in both multimodal and unimodal language models.

Abstract

Despite their remarkable successes, state-of-the-art language models face challenges in grasping certain important semantic details. This paper introduces the VISLA (Variance and Invariance to Semantic and Lexical Alterations) benchmark, designed to evaluate the semantic and lexical understanding of language models. VISLA presents a 3-way semantic (in)equivalence task with a triplet of sentences associated with an image, to evaluate both vision-language models (VLMs) and unimodal language models (ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprising difficulties in distinguishing between lexical and semantic variations. Spatial semantics encoded by language models also appear to be highly sensitive to lexical information. Notably, text encoders of VLMs demonstrate greater sensitivity to semantic and lexical variations than unimodal text encoders. Our contributions include the unification of image-to-text and text-to-text retrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessing LMs' semantic (in)variance in the presence of lexical alterations. The results highlight strengths and weaknesses across diverse vision and unimodal language models, contributing to a deeper understanding of their capabilities. % VISLA enables a rigorous evaluation, shedding light on language models' capabilities in handling semantic and lexical nuances. Data and code will be made available at https://github.com/Sri-Harsha/visla_benchmark.
Paper Structure (35 sections, 10 figures, 9 tables)

This paper contains 35 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Figure shows an example from our VISLA benchmark. $P_1$ and $P_2$ are semantically equivalent but lexically different while $N$ is semantically different than both $P_1$ and $P_2$ despite its lexical similarity with $P_1$. In our evaluations of state-of-the-art language models (consisting of 34 VLMs and 20 ULMs) on this example, we (surprisingly) find that none of them are able to successfully identify the semantically equivalent pair ($P_1$, $P_2$) from the semantically different pairs (($P_1$, $N$), ($P_2$, $N$)).
  • Figure 2: Role playing prompt for "Data Generator AI".
  • Figure 3: VISLA task Evaluation: Given an image $M$ and a triplet of candidate captions $\{$P$_1$, P$_2$, N$\}$ of $M$, where P$_1$ and P$_2$ are semantically equivalent to each other (referred to as positive captions in text), we measure the accuracy of ranking the negative caption N below the positive captions for both the Image and Text Encoder.
  • Figure 4: Rules Prompt used for priming LLM after role-playing instructions.
  • Figure 5: LLM Validation prompt to evaluate the generated caption.
  • ...and 5 more figures