Table of Contents
Fetching ...

DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale

Ke Du, Yimin Peng, Chao Gao, Fan Zhou, Siqiao Xue

TL;DR

DORAEMON tackles fragmentation in large-scale visual object modeling by providing a unified, YAML-driven PyTorch framework that supports image classification, face recognition, and retrieval through a shared backbone and modular heads. It exposes over $1{,}000$ pretrained backbones via the timm ecosystem, enables elastic distributed training, and offers one-click exports to ONNX, TensorRT, or TorchScript, bridging research and deployment. The framework achieves reproducible baselines on ImageNet-1K, MS-Celeb-1M, and Stanford Online Products, demonstrating strong cross-task transferability and scalability. With HuggingFace integration, Grad-CAM visualizations, and a configurable pipeline, DORAEMON lowers the barrier to rapid experimentation and production deployment, while positioning itself as a platform for future multimodal, language-informed vision systems.

Abstract

DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.

DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale

TL;DR

DORAEMON tackles fragmentation in large-scale visual object modeling by providing a unified, YAML-driven PyTorch framework that supports image classification, face recognition, and retrieval through a shared backbone and modular heads. It exposes over pretrained backbones via the timm ecosystem, enables elastic distributed training, and offers one-click exports to ONNX, TensorRT, or TorchScript, bridging research and deployment. The framework achieves reproducible baselines on ImageNet-1K, MS-Celeb-1M, and Stanford Online Products, demonstrating strong cross-task transferability and scalability. With HuggingFace integration, Grad-CAM visualizations, and a configurable pipeline, DORAEMON lowers the barrier to rapid experimentation and production deployment, while positioning itself as a platform for future multimodal, language-informed vision systems.

Abstract

DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.

Paper Structure

This paper contains 17 sections, 4 equations, 3 figures.

Figures (3)

  • Figure 1: Cumulative number of arXiv publications on large-scale visual object modeling. Labelled points highlight representative papers.
  • Figure 2: Cumulative number of Github open-source projects for large-scale visual object modeling. Labelled points highlight representative repositories.
  • Figure 4: Unified training pipeline of DORAEMON. The framework uses a shared visual backbone with modular task heads for classification, face recognition, and retrieval. Data processing, loss functions, and optimization are all configurable via YAML for scalable deployment.