If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions
Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto
TL;DR
This work tackles teaching pre-trained Vision-Language Models new visual concepts using only textual descriptions, without relying on real images or external generative models. The proposed Knowledge Transfer framework synthesizes target visuals via model inversion and fine-tunes the visual encoder with image–text contrastive learning, enabling robust cross-modal alignment for novel concepts and improved zero-shot downstream tasks. KT demonstrates strong gains across natural and medical domains in classification, segmentation, retrieval, and captioning, while preserving prior capabilities and showing potential for out-of-domain generalization. The approach offers a data-efficient, model-internal path to rapid concept expansion with practical appeal for data-scarce applications like medical imaging and beyond.
Abstract
Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including classification, segmentation, image-text retrieval, and captioning, we show that: 1) KT can efficiently introduce new visual concepts from a single textual description; 2) the same principle can be used to refine the representation of existing concepts; and 3) KT significantly improves the performance of zero-shot VLMs.
