Neural Language of Thought Models
Yi-Fu Wu, Minseung Lee, Sungjin Ahn
TL;DR
This work tackles unsupervised learning of a neural language of thought by proposing NLoTM, which combines Semantic Vector Quantization (SVQ) to produce hierarchical, discrete, object-centric factors and an Autoregressive LoT Prior (ALP) to generate semantic tokens in a compositional, probabilistic manner. By operating over factorized latent codes rather than patch tokens, NLoTM achieves improved generation quality, downstream task performance, and out-of-distribution generalization on 2D Sprite and 3D CLEVR-like datasets. The key contributions are the SVQ module that disentangles object factors into block-level codes, the ALP prior that models the data distribution autoregressively over these codes, and comprehensive experiments showing superior performance against patch-based VQ-VAE, dVAE, GENESIS-v2, and SysBinder baselines. This work advances the intersection of cognitive-inspired representations and machine learning, offering a path toward more human-like understanding through structured, discrete, semantically meaningful representations.
Abstract
The Language of Thought Hypothesis suggests that human cognition operates on a structured, language-like system of mental representations. While neural language models can naturally benefit from the compositional structure inherently and explicitly expressed in language data, learning such representations from non-linguistic general observations, like images, remains a challenge. In this work, we introduce the Neural Language of Thought Model (NLoTM), a novel approach for unsupervised learning of LoTH-inspired representation and generation. NLoTM comprises two key components: (1) the Semantic Vector-Quantized Variational Autoencoder, which learns hierarchical, composable discrete representations aligned with objects and their properties, and (2) the Autoregressive LoT Prior, an autoregressive transformer that learns to generate semantic concept tokens compositionally, capturing the underlying data distribution. We evaluate NLoTM on several 2D and 3D image datasets, demonstrating superior performance in downstream tasks, out-of-distribution generalization, and image generation quality compared to patch-based VQ-VAE and continuous object-centric representations. Our work presents a significant step towards creating neural networks exhibiting more human-like understanding by developing LoT-like representations and offers insights into the intersection of cognitive science and machine learning.
