Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
Yuwei Bao, Barrett Martin Lattimer, Joyce Chai
TL;DR
The paper tackles grounded word acquisition by modeling it as simultaneous information filtration and representation-symbol mapping, inspired by human learning principles. It introduces SOLA, a clean multimodal dataset, and a comparative learning framework that separates similarity and difference training to form word-grounded representations, later aligned to CLIP space via a Generative Decoder Learning stage. Empirical results in multi-attribute recognition, continual learning, and compositional reasoning show strong performance against baselines, resilience to forgetting, and notable zero-shot compositional capabilities. This approach advances continual grounded language acquisition and offers a pathway toward scalable, modular word understanding in multimodal AI systems. The work also discusses limitations and future directions, including scaling, memory management, and cross-modal sentence grounding.
Abstract
Human language acquisition is an efficient, supervised, and continual process. In this work, we took inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning. Motivated by cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label. We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping. This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently. Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words.
