Language-Informed Visual Concept Learning
Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
TL;DR
This work tackles learning fine-grained visual concepts along language-defined axes by distilling from large vision-language models. It trains axis-specific concept encoders that produce continuous embeddings, anchored to BLIP-2 text predictions and aligned through a reconstruction loss with a frozen Text-to-Image backbone, enabling faithful image reconstruction and compositional remixing. A lightweight test-time finetuning procedure extends generalization to unseen concepts, while experiments across five domains demonstrate improved disentanglement, color fidelity, and compositionality over text-prompt baselines. The approach offers data-efficient, controllable image generation and flexible concept remixing with practical applicability to real-world editing tasks.
Abstract
Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
