Neural Language of Thought Models

Yi-Fu Wu; Minseung Lee; Sungjin Ahn

Neural Language of Thought Models

Yi-Fu Wu, Minseung Lee, Sungjin Ahn

TL;DR

This work tackles unsupervised learning of a neural language of thought by proposing NLoTM, which combines Semantic Vector Quantization (SVQ) to produce hierarchical, discrete, object-centric factors and an Autoregressive LoT Prior (ALP) to generate semantic tokens in a compositional, probabilistic manner. By operating over factorized latent codes rather than patch tokens, NLoTM achieves improved generation quality, downstream task performance, and out-of-distribution generalization on 2D Sprite and 3D CLEVR-like datasets. The key contributions are the SVQ module that disentangles object factors into block-level codes, the ALP prior that models the data distribution autoregressively over these codes, and comprehensive experiments showing superior performance against patch-based VQ-VAE, dVAE, GENESIS-v2, and SysBinder baselines. This work advances the intersection of cognitive-inspired representations and machine learning, offering a path toward more human-like understanding through structured, discrete, semantically meaningful representations.

Abstract

The Language of Thought Hypothesis suggests that human cognition operates on a structured, language-like system of mental representations. While neural language models can naturally benefit from the compositional structure inherently and explicitly expressed in language data, learning such representations from non-linguistic general observations, like images, remains a challenge. In this work, we introduce the Neural Language of Thought Model (NLoTM), a novel approach for unsupervised learning of LoTH-inspired representation and generation. NLoTM comprises two key components: (1) the Semantic Vector-Quantized Variational Autoencoder, which learns hierarchical, composable discrete representations aligned with objects and their properties, and (2) the Autoregressive LoT Prior, an autoregressive transformer that learns to generate semantic concept tokens compositionally, capturing the underlying data distribution. We evaluate NLoTM on several 2D and 3D image datasets, demonstrating superior performance in downstream tasks, out-of-distribution generalization, and image generation quality compared to patch-based VQ-VAE and continuous object-centric representations. Our work presents a significant step towards creating neural networks exhibiting more human-like understanding by developing LoT-like representations and offers insights into the intersection of cognitive science and machine learning.

Neural Language of Thought Models

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 10 figures, 10 tables)

This paper contains 30 sections, 2 equations, 10 figures, 10 tables.

Introduction
Background
Vector-Quantized Variational Autoencoder (VQ-VAE)
Object-Centric Representations
Neural Language of Thought Model
Semantic Vector Quantization
Autoregressive Language of Thought Prior
Related Work
Experiments
Generating Samples with the Autoregressive LoT Prior
2D Sprites
CLEVR
Downstream Tasks
Odd-One-Out
CLEVR-Hard Property Comparison
...and 15 more sections

Figures (10)

Figure 1: Comparison between VQ-VAE, Quantized Slots, and SVQ. (a) VQ-VAE quantizes the scene at a local patch level and may not capture the semantic structure of the scene. (b) Quantized Slots (QS) would quantize the scene at the slot level but require a separate code for every possible configuration of an object. (c) SVQ quantizes at the block level, representing each factor (such as color or shape) as a code. In this example, to represent all possible object configurations, SVQ requires only 10 codebook entries at the block level while QS requires 25.
Figure 2: Overall architecture of NLoTM. (a) The Semantic Vector-Quantized (SVQ) Variational Autoencoder. We maintain $M$ learned codebooks and split each slot into $M$ blocks. After each Slot Attention iteration, we apply vector quantization to each block representation to obtain a set of discrete codes for each slot. Each block ends up specializing to different underlying factors of variation for the objects in the scene. (b) The Autoregressive LoT Prior (ALP). We train an autoregressive prior over the discrete latent codes from SVQ. Sampling from this prior allows us to generate an image one object at a time, based on their properties.
Figure 3: Generated samples for the 4-object 2D Sprites and 4-object 2D Sprites with background datasets.
Figure 4: Generated samples for the CLEVR-Easy, CLEVR-Hard, and CLEVR-Tex Datasets.
Figure 5: Sample scene we use in our codebook analysis.
...and 5 more figures

Neural Language of Thought Models

TL;DR

Abstract

Neural Language of Thought Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)