Table of Contents
Fetching ...

From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

Haodong Xie, Rahul Singh Maharjan, Federico Tavella, Angelo Cangelosi

TL;DR

This work tackles the challenge of learning high-order abstract concepts by proposing a developmental, multimodal framework that grounds subordinate concrete concepts to form basic and superordinate abstractions across vision and language. It introduces a multimodal variational autoencoder (MMVAE) architecture that combines a visual VAE operating on 2048‑dimensional ResNet features with a language VAE built on frozen BERT embeddings, fused through a mixture-of-experts to learn a joint multimodal distribution and enable cross-modal generation. The model demonstrates cross-modal capabilities in language-to-vision and vision-to-language tasks, achieving strong accuracies at subordinate and basic levels and CLIP alignment on a relatively small dataset, while ablation studies show scalability to larger concept sets. This approach offers a development‑inspired pathway for abstract concept learning in AI and points to future extensions to more abstract levels and developmental linguistic data.

Abstract

Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.

From Concrete to Abstract: A Multimodal Generative Approach to Abstract Concept Learning

TL;DR

This work tackles the challenge of learning high-order abstract concepts by proposing a developmental, multimodal framework that grounds subordinate concrete concepts to form basic and superordinate abstractions across vision and language. It introduces a multimodal variational autoencoder (MMVAE) architecture that combines a visual VAE operating on 2048‑dimensional ResNet features with a language VAE built on frozen BERT embeddings, fused through a mixture-of-experts to learn a joint multimodal distribution and enable cross-modal generation. The model demonstrates cross-modal capabilities in language-to-vision and vision-to-language tasks, achieving strong accuracies at subordinate and basic levels and CLIP alignment on a relatively small dataset, while ablation studies show scalability to larger concept sets. This approach offers a development‑inspired pathway for abstract concept learning in AI and points to future extensions to more abstract levels and developmental linguistic data.

Abstract

Understanding and manipulating concrete and abstract concepts is fundamental to human intelligence. Yet, they remain challenging for artificial agents. This paper introduces a multimodal generative approach to high order abstract concept learning, which integrates visual and categorical linguistic information from concrete ones. Our model initially grounds subordinate level concrete concepts, combines them to form basic level concepts, and finally abstracts to superordinate level concepts via the grounding of basic-level concepts. We evaluate the model language learning ability through language-to-visual and visual-to-language tests with high order abstract concepts. Experimental results demonstrate the proficiency of the model in both language understanding and language naming tasks.
Paper Structure (14 sections, 4 equations, 4 figures, 5 tables)

This paper contains 14 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The Joint Modalities Learning and Cross Modality Test. During training, the model learns the concepts jointly with both visual and linguistic information. During testing, when one modality is missing, the model generates the missing modality based on the input modality.
  • Figure 2: The overall structure of our model: similar to MVAE19 proposed by Wu et al.Wu, which treats each attribute as its own modality, we consider each level of concepts as its own modality and apply two separate language VAEs for the label input. A joint distribution representing the information from these four modalities is generated by the MoE network. The decoders sample from the latent space and then regenerate the image and the two levels of concepts.
  • Figure 3: The results for the Language Understanding (Language-to-Vision). The column on the left displays the images generated from the subordinate level concepts, while the column on the right shows the images generated from the basic level concepts.
  • Figure 4: The result for Language Naming (Vision-to-Language). The language modality is treated as the missing modality. The subordinate level labels and the basic level labels generated from the image.