Table of Contents
Fetching ...

MIMIC: Masked Image Modeling with Image Correspondences

Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna

TL;DR

Dense vision tasks require pixel-accurate representations, but large-scale pretraining is hampered by the lack of multi-view data with metadata. The authors propose MIMIC, an annotation-free data-curation pipeline that mines multi-view image pairs from unannotated real videos and 3D environments, enabling masked image modeling with MAE and CroCo. Across two scales (MIMIC-1M and MIMIC-3M), MIMIC-3M-pretrained models outperform ImageNet-1K and Multiview-Habitat baselines on depth, normals, segmentation, and pose tasks, with strong few-shot and reconstruction-quality results. This work demonstrates scalable, data-driven pathways to high-quality dense representations, opening doors to large-scale pretraining without manual annotation.

Abstract

Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric tasks: depth estimation on NYUv2 (1.7%), and surface normals estimation on Taskonomy (2.05%). For dense tasks which also require object understanding, we outperform MULTIVIEW-HABITAT, on semantic segmentation on ADE20K (3.89%), pose estimation on MSCOCO (9.4%), and reduce the gap with models pre-trained on the object-centric expensive ImageNet-1K. We outperform even when the representations are frozen, and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

MIMIC: Masked Image Modeling with Image Correspondences

TL;DR

Dense vision tasks require pixel-accurate representations, but large-scale pretraining is hampered by the lack of multi-view data with metadata. The authors propose MIMIC, an annotation-free data-curation pipeline that mines multi-view image pairs from unannotated real videos and 3D environments, enabling masked image modeling with MAE and CroCo. Across two scales (MIMIC-1M and MIMIC-3M), MIMIC-3M-pretrained models outperform ImageNet-1K and Multiview-Habitat baselines on depth, normals, segmentation, and pose tasks, with strong few-shot and reconstruction-quality results. This work demonstrates scalable, data-driven pathways to high-quality dense representations, opening doors to large-scale pretraining without manual annotation.

Abstract

Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric tasks: depth estimation on NYUv2 (1.7%), and surface normals estimation on Taskonomy (2.05%). For dense tasks which also require object understanding, we outperform MULTIVIEW-HABITAT, on semantic segmentation on ADE20K (3.89%), pose estimation on MSCOCO (9.4%), and reduce the gap with models pre-trained on the object-centric expensive ImageNet-1K. We outperform even when the representations are frozen, and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.
Paper Structure (27 sections, 12 figures, 7 tables)

This paper contains 27 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: We introduce a data-curation method that generates multi-view image datasets for self-supervised learning. Our method identifies potential data sources, including videos of indoor scenes, people, and objects, 3D indoor environments, outdoor street views, and stereo pairs to mine potential multiview images. Next, we use classical computer vision methods such as SIFT keypoint detection and homography transformation to locate corresponding patches. Finally, we filter pairs based on a threshold for significant overlap, ensuring a substantial percentage of pixels match between a pair.
  • Figure 2: Distribution of Data Sources (%). Real data sources, including DeMoN, ScanNet, ArkitScenes, Objectron, CO3D, Mannequin, and 3DStreetView, contribute to 32% of MIMIC. The remaining portion consists of synthetic sources, namely HM3D, Gibson, and Matterport.
  • Figure 3: (a) CroCo pretrained on MIMIC shows an increasing trend with the number of training epochs. The figure on the left shows the trends for the fine-tuned and frozen versions of the encoder on NYUv2 depth estimation. The figure on the right shows the trend on the ADE20K dataset. (b) CroCo pretrained on MIMIC-3M achieves better few shot performance on CroCo pretrained on Multiview-Habitat. The figure on the left shows the few shot performance on the NYUv2 dataset and the figure on the right shows the few shot performance on ADE20K.
  • Figure 4: (a) A pair of images with SIFT key points. (b) Matching key points of images with a brute force matcher.
  • Figure 5: (a) Inlier matches after finding the homography matrix. (b) Dividing each image to non-overlapping patches.
  • ...and 7 more figures