From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning
Haodong Xie, Rahul Singh Maharjan, Federico Tavella, Angelo Cangelosi
TL;DR
This work tackles the challenge of learning high-order abstract concepts by proposing a developmental, multimodal framework that grounds subordinate concrete concepts to form basic and superordinate abstractions across vision and language. It introduces a multimodal variational autoencoder (MMVAE) architecture that combines a visual VAE operating on 2048‑dimensional ResNet features with a language VAE built on frozen BERT embeddings, fused through a mixture-of-experts to learn a joint multimodal distribution and enable cross-modal generation. The model demonstrates cross-modal capabilities in language-to-vision and vision-to-language tasks, achieving strong accuracies at subordinate and basic levels and CLIP alignment on a relatively small dataset, while ablation studies show scalability to larger concept sets. This approach offers a development‑inspired pathway for abstract concept learning in AI and points to future extensions to more abstract levels and developmental linguistic data.
Abstract
Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.
