Table of Contents
Fetching ...

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto

TL;DR

This work tackles teaching pre-trained Vision-Language Models new visual concepts using only textual descriptions, without relying on real images or external generative models. The proposed Knowledge Transfer framework synthesizes target visuals via model inversion and fine-tunes the visual encoder with image–text contrastive learning, enabling robust cross-modal alignment for novel concepts and improved zero-shot downstream tasks. KT demonstrates strong gains across natural and medical domains in classification, segmentation, retrieval, and captioning, while preserving prior capabilities and showing potential for out-of-domain generalization. The approach offers a data-efficient, model-internal path to rapid concept expansion with practical appeal for data-scarce applications like medical imaging and beyond.

Abstract

Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including classification, segmentation, image-text retrieval, and captioning, we show that: 1) KT can efficiently introduce new visual concepts from a single textual description; 2) the same principle can be used to refine the representation of existing concepts; and 3) KT significantly improves the performance of zero-shot VLMs.

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

TL;DR

This work tackles teaching pre-trained Vision-Language Models new visual concepts using only textual descriptions, without relying on real images or external generative models. The proposed Knowledge Transfer framework synthesizes target visuals via model inversion and fine-tunes the visual encoder with image–text contrastive learning, enabling robust cross-modal alignment for novel concepts and improved zero-shot downstream tasks. KT demonstrates strong gains across natural and medical domains in classification, segmentation, retrieval, and captioning, while preserving prior capabilities and showing potential for out-of-domain generalization. The approach offers a data-efficient, model-internal path to rapid concept expansion with practical appeal for data-scarce applications like medical imaging and beyond.

Abstract

Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including classification, segmentation, image-text retrieval, and captioning, we show that: 1) KT can efficiently introduce new visual concepts from a single textual description; 2) the same principle can be used to refine the representation of existing concepts; and 3) KT significantly improves the performance of zero-shot VLMs.

Paper Structure

This paper contains 73 sections, 7 equations, 13 figures, 19 tables.

Figures (13)

  • Figure 1: Overview of Knowledge Transfer (KT):\ref{['fig:kt-vs-synth']} shows our research question, while \ref{['fig:results-overview']} summarizes performance improvements.
  • Figure 2: Knowledge Transfer can introduce novel concepts in a multimodal model, by leveraging prior visual knowledge of the visual encoder and a textual description of the target concept. In the example, a CLIP model radford2021learning learns the concepts Moongate and Tonometer, without using any real image, while retaining a good accuracy on general zero-shot classification (58.10% vs 56.43% and 70.79% vs 70.61% on ImageNet-1k).
  • Figure 3: Knowledge Transfer (KT) on novel and rare concepts (high-level and fine-grained concepts). KT achieves improvements (even notable) in the target accuracy on the novel concept in all instances. We also make sure that catastrophic forgetting does not occur by monitoring zero-shot accuracy on ImageNet, which remains comparable with the baseline.
  • Figure 4: Qualitative evaluation of KT on breast tumor segmentation (UDIAT dataset). We report illustrative examples where knowledge transfer improved segmentation, in terms of DSC.
  • Figure 5: KT results at different inversion steps. Better inversion quality (more steps) generally leads to improved results.
  • ...and 8 more figures