Nomic Embed Vision: Expanding the Latent Space
Zach Nussbaum, Brandon Duderstadt, Andriy Mulyar
TL;DR
This work addresses the need for a unified latent space that supports vision, language, and multimodal tasks by training nomic-embed-vision to align with nomic-embed-text. Using a Locked Text Tuning approach, the authors freeze a pretrained text encoder and train a vision encoder (EVA02-ViT B/16) with a CLIP-style objective on a large DFN-2B-derived dataset, achieving strong cross-modal and text benchmarks. Key contributions include demonstrating that a unified latent space can be built with open weights and that MAP pooling and large-batch training yield robust performance across Imagenet zero-shot and Flickr retrieval. The work has practical impact by enabling open-source, high-performance multimodal embeddings that combine vision and language representations in a single space, facilitating downstream retrieval and understanding tasks.
Abstract
This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.
