Table of Contents
Fetching ...

Visual Lexicon: Rich Image Features in Language Space

XuDong Wang, Xingyi Zhou, Alireza Fathi, Trevor Darrell, Cordelia Schmid

TL;DR

ViLex presents a visual lexicon that maps images into the text space of diffusion-based language models, enabling rich semantic and visual fidelity within a lightweight, text-prompt-like representation. Trained in a self-supervised autoencoder framework using a frozen T2I diffusion model as decoder, ViLex supports both image generation and understanding, including zero-shot DreamBooth-style re-contextualization without fine-tuning diffusion models. Empirically, ViLex improves image reconstruction (lower FID, higher IS) and enhances vision-language benchmarks across 15 tasks, outperforming strong baselines when used alone or combined with natural language prompts. This approach offers a practical, versatile visual encoder that can plug into existing VLMs with minimal token overhead, advancing multimodal generation and understanding in a unified framework.

Abstract

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.

Visual Lexicon: Rich Image Features in Language Space

TL;DR

ViLex presents a visual lexicon that maps images into the text space of diffusion-based language models, enabling rich semantic and visual fidelity within a lightweight, text-prompt-like representation. Trained in a self-supervised autoencoder framework using a frozen T2I diffusion model as decoder, ViLex supports both image generation and understanding, including zero-shot DreamBooth-style re-contextualization without fine-tuning diffusion models. Empirically, ViLex improves image reconstruction (lower FID, higher IS) and enhances vision-language benchmarks across 15 tasks, outperforming strong baselines when used alone or combined with natural language prompts. This approach offers a practical, versatile visual encoder that can plug into existing VLMs with minimal token overhead, advancing multimodal generation and understanding in a unified framework.

Abstract

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as "text tokens" or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings--even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.

Paper Structure

This paper contains 13 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: top)ViLex empowers linguistic space to capture visual richness. We propose ViLex, an image encoder that maps images into the vocabulary space, effectively preserving semantic information and intricate visual details. The embeddings from ViLex function as a Visual Lexicon that preserve semantic and intricate visual details of the image. ViLex is trained with a frozen text-to-image diffusion model and can be utilized independently as "text" tokens for image generation. bottom)Linguistic space empowers ViLex to enjoy compositionality. ViLex can be combined with natural language tokens for prompting a pretrained T2I diffusion models with both visual and textual cues.
  • Figure 2: The pipeline of ViLex: We learn a Visual Lexicon from a frozen diffusion model using an image reconstruction loss. After training, ViLex can be directly used as the "text-prompt" to a frozen text encoder, e.g., CLIP or T5, enabling the re-creation of semantically similar images without the need for actual text prompts. In addition, during training, we implement the TailDrop strategy, where the last $k$ tokens are randomly dropped, encouraging earlier tokens in ViLex to carry richer semantic information. ViLex tokens can be utilized independently as "text" tokens for image generation or combined with natural language tokens for prompting T2I diffusion models with both visual and textual cues for multimodal image generation.
  • Figure 3: ViLex retains more visual details in image-to-image generation compared to DALL$\cdot$E 3 betker2023improving and DeDiffusion wei2024diffusion, accurately capturing elements such as image style (e.g., the oil painting style in row 1), layout (e.g., the relative position of the corgi and the lighthouse), pose (e.g., the corgi’s stance), and object shapes (e.g., the shape of Van Gogh's hat). This enables ViLex to produce images that are both semantically and visually consistent with the original input. Even models with text embeddings in a shared language-vision space, like DALL$\cdot$E 3, capable of generating semantic variations of an image, struggle to faithfully reconstruct the original appearance of the input image. For image-guided DALL$\cdot$E results, we provide the input images along with the text prompt, "generate an image exactly the same as the input image". For DeDiffusion, we follow its official image-to-image generation pipeline and use SDXL podell2023sdxl as the T2I model.
  • Figure 4: ViLex can be seamlessly integrated with natural language prompts for zero-shot unsupervised image re-contextualization using a frozen text-to-image (T2I) diffusion model. Unlike DreamBooth ruiz2023dreambooth, ViLex requires no fine-tuning of the T2I model on a set of input images from the same object or modifications to the model architecture (e.g., adding a LoRA hu2021lora adapter). Instead, ViLex is a universal model that enables zero-shot, unsupervised re-contextualization by simply prompting the T2I model with ViLex tokens and corresponding text prompt tokens, just like how we use real words. a) The inference pipeline demonstrating image re-contextualization. b) Qualitative comparisons with DreamBooth, with DreamBooth results taken from their project page.
  • Figure 5: ViLex can also support zero-shot unsupervised art rendition via prompting T2I models with ViLex and text prompts.
  • ...and 4 more figures