Table of Contents
Fetching ...

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong

TL;DR

VISTA tackles the limitation of text-centric dense retrieval by introducing a universal multi-modal embedding that fuses visual tokens into a fixed text encoder via a ViT-based image tokenizer. It leverages two annotation-free data pipelines, IT2I and T2IT, and a two-stage training regime with cross-modal alignment followed by composed image–text training, achieving state-of-the-art zero-shot and supervised results across multiple benchmarks. The approach demonstrates strong generalization and practical applicability for retrieval-augmented systems, without task-specific tuning. Overall, VISTA provides a scalable, open-source pathway to robust multi-modal retrieval across diverse data modalities.

Abstract

Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

TL;DR

VISTA tackles the limitation of text-centric dense retrieval by introducing a universal multi-modal embedding that fuses visual tokens into a fixed text encoder via a ViT-based image tokenizer. It leverages two annotation-free data pipelines, IT2I and T2IT, and a two-stage training regime with cross-modal alignment followed by composed image–text training, achieving state-of-the-art zero-shot and supervised results across multiple benchmarks. The approach demonstrates strong generalization and practical applicability for retrieval-augmented systems, without task-specific tuning. Overall, VISTA provides a scalable, open-source pathway to robust multi-modal retrieval across diverse data modalities.

Abstract

Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.
Paper Structure (26 sections, 8 equations, 7 figures, 8 tables)

This paper contains 26 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The model architecture of our VISTA model. We use the pre-trained language model as the foundation, making the ViT encoder transfer the Image to recognized tokens of the text encoder.
  • Figure 2: The construction pipeline of Image&Text To Image (IT2T) dataset.
  • Figure 3: The specific prompts utilized during the generation of the Image&Text To Image (IT2T) dataset.
  • Figure 4: The specific prompts employed in the generation of the Text to Image&Text (T2IT) dataset, with the lengths of the articles and queries randomly assigned in each data generation iteration to ensure diversity. Typically, articles are approximately 50 words, and queries are within 20 words.
  • Figure 5: The Qualitative Examples of our VISTA Model on the CIRR Benchmark.
  • ...and 2 more figures