Table of Contents
Fetching ...

Nomic Embed Vision: Expanding the Latent Space

Zach Nussbaum, Brandon Duderstadt, Andriy Mulyar

TL;DR

This work addresses the need for a unified latent space that supports vision, language, and multimodal tasks by training nomic-embed-vision to align with nomic-embed-text. Using a Locked Text Tuning approach, the authors freeze a pretrained text encoder and train a vision encoder (EVA02-ViT B/16) with a CLIP-style objective on a large DFN-2B-derived dataset, achieving strong cross-modal and text benchmarks. Key contributions include demonstrating that a unified latent space can be built with open weights and that MAP pooling and large-batch training yield robust performance across Imagenet zero-shot and Flickr retrieval. The work has practical impact by enabling open-source, high-performance multimodal embeddings that combine vision and language representations in a single space, facilitating downstream retrieval and understanding tasks.

Abstract

This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.

Nomic Embed Vision: Expanding the Latent Space

TL;DR

This work addresses the need for a unified latent space that supports vision, language, and multimodal tasks by training nomic-embed-vision to align with nomic-embed-text. Using a Locked Text Tuning approach, the authors freeze a pretrained text encoder and train a vision encoder (EVA02-ViT B/16) with a CLIP-style objective on a large DFN-2B-derived dataset, achieving strong cross-modal and text benchmarks. Key contributions include demonstrating that a unified latent space can be built with open weights and that MAP pooling and large-batch training yield robust performance across Imagenet zero-shot and Flickr retrieval. The work has practical impact by enabling open-source, high-performance multimodal embeddings that combine vision and language representations in a single space, facilitating downstream retrieval and understanding tasks.

Abstract

This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Multimodal and Text Embedding Benchmark Aggregate performance of Nomic Embed v1.5, OpenAI CLIP ViT B/16, and Jina CLIP v1 on text and multimodal benchmarks. Nomic Embed V1.5 is the only multimodal encoder to outperform OpenAI CLIP on multimodal and text benchmarks. X-axis units vary per benchmark suite. Imagenet is Imagenet Zero-Shot, Datacomp is a suite of 38 zero-shot multimodal evaluations, and MTEB evaluates performance of text embedding models.
  • Figure 2: Imagenet Zero-Shot Top 1 Accuracy improves as we increase batch size in small scale experiments
  • Figure 3: Effect of Pooling Layer on Performance in various retrieval and classification setups.