Table of Contents
Fetching ...

Individuation in Neural Models with and without Visual Grounding

Alexey Tikhonov, Lisa Bylinina, Ivan P. Yamshchikov

TL;DR

It is demonstrated that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data and the individuation hierarchy deduced agrees with the hierarchies proposed in linguistics and cognitive science.

Abstract

We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

Individuation in Neural Models with and without Visual Grounding

TL;DR

It is demonstrated that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data and the individuation hierarchy deduced agrees with the hierarchies proposed in linguistics and cognitive science.

Abstract

We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.
Paper Structure (10 sections, 2 equations, 2 figures, 3 tables)

This paper contains 10 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Side by side comparison of contrasting capabilities that models have for various number of objects. The heat map represents average distances for the pairs of embeddings that model provides for various quantities of the same objects. The results are averaged across all objects and normalized.
  • Figure 2: P-values for the individuation capabilities of CLIP in comparison with SBERT and FastText based on the proxy metric for individuation. The classes with $p>5\%$ are not significantly distinguishable. The order of rows is in line with the average value of the proposed individuation proxy: the lower individuated classes are on top, the more individuated ones are on the bottom. The order of columns repeats the order of rows making every matrix symmetric.