Table of Contents
Fetching ...

Quantifying and extending the coverage of spatial categorization data sets

Wanchun Li, Alexandra Carstensen, Yang Xu, Terry Regier, Charles Kemp

TL;DR

It is demonstrated that labels generated by large language models (LLMs) align relatively well with human labels, and how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets are shown.

Abstract

Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.

Quantifying and extending the coverage of spatial categorization data sets

TL;DR

It is demonstrated that labels generated by large language models (LLMs) align relatively well with human labels, and how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets are shown.

Abstract

Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.
Paper Structure (7 sections, 1 equation, 4 figures, 1 table)

This paper contains 7 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) Spatial relations stimulus sets. Each stimulus shows the relation between a focal object (shown in gold, or marked with an arrow) and a background object. Some of the original images have been cropped for this figure. (i) The Topological Relations Picture Series (TRPS; Bowerman and Pederson, 1992) includes 71 scenes that were chosen to explore the boundaries of terms for on and in relations. (ii) The Zhang set zhang13 includes 63 scenes illustrating configurations that are absent from the TRPS but relevant to the expression of on and in in Chinese. The examples here include "girls in line," "shadow on wall" and "food on plate." (iii) The LJSP set landaujsp17 includes 44 scenes that were designed to span six subtypes of containment and five subtypes of support. The examples here include "paper in box," "cup in sleeve," and "cup on counter." (iv) The LCXRK set developed by us includes 42 images designed to illustrate spatial terms in English and Chinese that are not represented by the TRPS, and to include negations and reversals of TRPS scenes. The examples here include "cat among flowers." b) Extending a cross-linguistic data set using LLMs. (i) The original human labels are organized into a dark gray matrix of languages (rows) by scenes (columns). (ii) We add candidate languages (new rows) and candidate scenes (new columns) and generate LLM labels for all new combinations of languages and scenes. (iii) A coverage measure based on LLM labels is used to prioritize languages and scenes for subsequent human labeling.
  • Figure 2: Evaluation of LLM labels against data collected by (a) carstensenkhr19 and (b) xuk10. The binary score for an image label is 1 if a human provided the same label, and the graded score is the proportion of humans who provided the label. In (a), black points show the maximum possible graded score for each language. In (b), black points show the average result when a single human from the carstensenkhr19 data is scored with respect to a different speaker of the same language, and error bars show the 95% interquantile range.
  • Figure 3: MDS visualization of a space that includes scenes from all stimulus sets considered in this paper. The four panels show the coverage of this space achieved by (a) the TRPS (b) the Zhang set (c) the LJSP set and (d) our new stimulus set. Points labeled in panels (a) and (d) correspond to images shown in Figures \ref{['fig:stimuli']}a.i and \ref{['fig:stimuli']}a.iv.
  • Figure 4: MDS visualization of the similarity between spatial systems from the languages represented in Figure \ref{['fig:llm_evaluation']}. Languages appearing in the carstensenkhr19 data set are shown in red.