Table of Contents
Fetching ...

Language-Informed Visual Concept Learning

Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu

TL;DR

This work tackles learning fine-grained visual concepts along language-defined axes by distilling from large vision-language models. It trains axis-specific concept encoders that produce continuous embeddings, anchored to BLIP-2 text predictions and aligned through a reconstruction loss with a frozen Text-to-Image backbone, enabling faithful image reconstruction and compositional remixing. A lightweight test-time finetuning procedure extends generalization to unseen concepts, while experiments across five domains demonstrate improved disentanglement, color fidelity, and compositionality over text-prompt baselines. The approach offers data-efficient, controllable image generation and flexible concept remixing with practical applicability to real-world editing tasks.

Abstract

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

Language-Informed Visual Concept Learning

TL;DR

This work tackles learning fine-grained visual concepts along language-defined axes by distilling from large vision-language models. It trains axis-specific concept encoders that produce continuous embeddings, anchored to BLIP-2 text predictions and aligned through a reconstruction loss with a frozen Text-to-Image backbone, enabling faithful image reconstruction and compositional remixing. A lightweight test-time finetuning procedure extends generalization to unseen concepts, while experiments across five domains demonstrate improved disentanglement, color fidelity, and compositionality over text-prompt baselines. The approach offers data-efficient, controllable image generation and flexible concept remixing with practical applicability to real-world editing tasks.

Abstract

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.
Paper Structure (36 sections, 3 equations, 29 figures, 2 tables)

This paper contains 36 sections, 3 equations, 29 figures, 2 tables.

Figures (29)

  • Figure 1: Language-Informed Visual Concept Learning. Our goal is to learn a visual concept representation grounded on a set of language-informed concept axes, e.g.,category, color, and material, by simply distilling from pre-trained text-to-image generation models without manual annotations. After training, the concept encoders extract disentangled axis-specific embeddings from an image, which can be remixed to generate new images with novel concept compositions.
  • Figure 2: Learned Disentangled Concept Embeddings Improve Compositionality. Left: Vanilla text-to-image model may fail to adhere to text prompts of uncommon combinations of concepts even with prompt engineering, e.g."red banana". Right: With the same backbone T2I generator, our learned disentangled concept embeddings greatly enhance concept compositionality.
  • Figure 3: Training Pipeline. During training, an input image is processed by a set of concept encoders that predict concept embeddings specific to given concept axes. These embeddings are trained to (1) retain information in order to reproduce visual inputs via a pre-trained Text-to-Image model given an axis-informed text template, and (2) ensure disentanglement across different axes by anchoring to text embeddings obtained from a pre-trained Visual Question Answering model.
  • Figure 4: Concept Recomposition. At test time, our model extracts visual concepts along various axes from different images and recompose them to generate new images. We show recomposition results across different pairs of concept axes in 3 datasets: (a) Fruits, (b) Paintings, and (c) Furniture.
  • Figure 5: Generalization to Unseen Concepts via Finetuning. After test-time fine-tuning on a single test-time image, encoders can adapt to novel concept. Visual details from the input images are preserved as can be seen from images visualizing embedding predictions. Importantly, these embeddings do not overfit to the input images and maintain a good disentanglement, such that they can be freely recomposed to create new concepts. More results can be found in \ref{['fig:app_exp_qualitative_seasons', 'fig:app_exp_qualitative_paintings', 'fig:app_exp_qualitative_fruits', 'fig:app_exp_qualitative_chairs', 'fig:app_exp_qualitative_objects']}. More real-world results can be found in \ref{['fig:real_world_objects', 'fig:real_world_objects_2', 'fig:real_world_fruits', 'fig:real_world_furniture_appendix', 'fig:real_world_art_appendix']}.
  • ...and 24 more figures