Table of Contents
Fetching ...

Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models

Bruno Bianchi, Aakash Agrawal, Stanislas Dehaene, Emmanuel Chemla, Yair Lakretz

TL;DR

This work introduces CompOrth, a targeted benchmark to assess whether visual models can disentangle letter identity from letter position to support compositional word processing. Using $β$-VAE variants trained on images of short letter strings, the study evaluates both behavioral generalization across spatial, length, and compositional factors and the neural disentanglement of latent representations via MIR and perturbations. The results show robust generalization to unseen retinal positions but systematic failures in generalizing to longer word lengths and unseen letter-position compositions, with only weak evidence for disentangled encoding of identity and position. Overall, CompOrth reveals a gap between human-like orthographic compositionality and current disentanglement-focused models, motivating new architectures and benchmarks for visual language processing.

Abstract

Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ($β$-VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal' locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art $β$-VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.

Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models

TL;DR

This work introduces CompOrth, a targeted benchmark to assess whether visual models can disentangle letter identity from letter position to support compositional word processing. Using -VAE variants trained on images of short letter strings, the study evaluates both behavioral generalization across spatial, length, and compositional factors and the neural disentanglement of latent representations via MIR and perturbations. The results show robust generalization to unseen retinal positions but systematic failures in generalizing to longer word lengths and unseen letter-position compositions, with only weak evidence for disentangled encoding of identity and position. Overall, CompOrth reveals a gap between human-like orthographic compositionality and current disentanglement-focused models, motivating new architectures and benchmarks for visual language processing.

Abstract

Human readers can accurately count how many letters are in a word (e.g., 7 in ``buffalo''), remove a letter from a given position (e.g., ``bufflo'') or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder (-VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal' locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art -VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.

Paper Structure

This paper contains 22 sections, 9 figures.

Figures (9)

  • Figure 1: (A) The CompOrth Benchmark: schemes of the three types of generalization tests. (B) An illustration of the architecture of a auto-encoder for processing images of written words.
  • Figure 2: Model Selection(A) Reconstruction loss against Mutual Information Ratio. Each dot represents a model. Orange dots represents models from which samples were taken as examples in the right panels. (B) Reconstruction examples from the model with best reconstruction loss and MIG ($\beta$=4, Latent Size=32, Learning Rate=0.0001). (C) Reconstruction examples from a model with relatively poor reconstruction loss ($\beta$=2, Latent Size=128, Learning Rate=0.0001). Models marked with purple circles represent the Pareto Front.
  • Figure 3: Results on CompOrth tests. Average reconstruction loss on the (A) Retinal-Position Test, (B) Word-Length Test, and (C) the Abstract-Position Test. Dashed lines mark the chance level for the classifier ($chance=1/62$). On the right of each panel, several examples are shown for how the model reconstruct test images. Note the red marking on the images, which highlight the type of errors the model makes.
  • Figure 4: Neural Perturbation Analyses(A) Examples of an hypothetical encoding scheme: single neurons for positions, where the degree of activation indicates which letter is present in that position. (B-D) Perturbation results for example units from a model with strong performance on CompOrth ($\beta=4$, latent-size$=32$). Each row represents different samples (word images), columns represents different levels of perturbation.
  • Figure A.1: Examples from the generated dataset. All the images are comprised by a string of 1 to 5 letters, using only the uppercase characters A and B. To generate variations of this strings the spacing and the x and y position were modified.
  • ...and 4 more figures