Table of Contents
Fetching ...

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes

TL;DR

This work introduces multisense consistency, a paradigm that evaluates LLM understanding by testing whether model-generated senses (here, translations) yield consistent task performance across languages. By applying this to PAWS-X and XNLI in English, German, and Chinese, the study demonstrates substantial language-dependent fluctuations in task interpretation and execution, even in a state-of-the-art system. The approach remains data-contamination-free and easily extensible, offering a practical complement to traditional benchmarks and a path toward more robust, sense-aware evaluation in multilingual settings. The findings highlight the need to consider form and meaning separately when benchmarking LLMs and suggest directions for improving cross-language task understanding.

Abstract

At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilinguality. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses

TL;DR

This work introduces multisense consistency, a paradigm that evaluates LLM understanding by testing whether model-generated senses (here, translations) yield consistent task performance across languages. By applying this to PAWS-X and XNLI in English, German, and Chinese, the study demonstrates substantial language-dependent fluctuations in task interpretation and execution, even in a state-of-the-art system. The approach remains data-contamination-free and easily extensible, offering a practical complement to traditional benchmarks and a path toward more robust, sense-aware evaluation in multilingual settings. The findings highlight the need to consider form and meaning separately when benchmarking LLMs and suggest directions for improving cross-language task understanding.

Abstract

At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilinguality. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.
Paper Structure (35 sections, 3 figures, 12 tables)

This paper contains 35 sections, 3 figures, 12 tables.

Figures (3)

  • Figure 1: Illustration of the basic mechanism of our paradigm: We use the model to generate other senses of the original input. The model's answers on the original input and the alternative sense are used to evaluate its consistency. In this example, the model is presented with the task of paraphrase detection in English (sentences taken from PAWS-X) and generates another sense by translating from English to German.
  • Figure 2: An overview of our experiments and analyses.
  • Figure 3: Our experiments assess cross-lingual generalisation for natural corpora, in pretrained LLMs, to assess LLM task understanding.