Table of Contents
Fetching ...

MOFI: Learning Image Representations from Noisy Entity Annotated Images

Wentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang

TL;DR

MOFI tackles scalable visual representation learning from noisy web data by automatically labeling images with entities extracted from text via NER and disambiguated through Wikidata. It introduces Image-to-Entities (I2E), a billion-scale dataset with about 1.24B images and ~3.1M entities, and evaluates three training paradigms: supervised, contrastive, and multi-task pre-training, with multi-task MOFI achieving state-of-the-art retrieval on GPR1200 (around $86\%$ mAP) and strong zero-shot and linear-probe results on ImageNet and VTAB. The results demonstrate that entity-centric labels plus multi-task objectives yield richer representations than image-text pairs alone, improving both retrieval and classification tasks. The work highlights scalable data construction and a unified framework combining classification and contrastive signals, with practical impact for building robust, general-purpose vision foundations and an open-source release of code and weights.

Abstract

We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.

MOFI: Learning Image Representations from Noisy Entity Annotated Images

TL;DR

MOFI tackles scalable visual representation learning from noisy web data by automatically labeling images with entities extracted from text via NER and disambiguated through Wikidata. It introduces Image-to-Entities (I2E), a billion-scale dataset with about 1.24B images and ~3.1M entities, and evaluates three training paradigms: supervised, contrastive, and multi-task pre-training, with multi-task MOFI achieving state-of-the-art retrieval on GPR1200 (around mAP) and strong zero-shot and linear-probe results on ImageNet and VTAB. The results demonstrate that entity-centric labels plus multi-task objectives yield richer representations than image-text pairs alone, improving both retrieval and classification tasks. The work highlights scalable data construction and a unified framework combining classification and contrastive signals, with practical impact for building robust, general-purpose vision foundations and an open-source release of code and weights.

Abstract

We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.
Paper Structure (15 sections, 3 equations, 6 figures, 6 tables)

This paper contains 15 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: MOFI is trained on the new Image-to-Entities (I2E) dataset, which has 66x more classes than the previous datasets, and achieves significantly better performance on the image retrieval tasks.
  • Figure 2: Examples of the I2E dataset. Each caption is formatted as Entity_id (Entity_name).
  • Figure 3: Illustration of different approaches explored in this paper to learn image representations from I2E dataset. Supervised pre-training treats entities as labels, contrastive pre-training uses entity names and descriptions as free-form text, and multi-task pre-training combines the two.
  • Figure 4: Examples of top-1 retrieved images on GPR1200, $\mathcal{R}$Oxford and $\mathcal{R}$Paris foot:oxford evaluation sets. Green (✓) and red (✗) indicate positive or negative images, respectively.
  • Figure 5: t-SNE visualization of MOFI learned image representations on GPR1200 evaluation set. The left figure shows the distribution of six domains in the feature space. The right figure shows the distribution in stfproduct domain.
  • ...and 1 more figures