DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale
Ke Du, Yimin Peng, Chao Gao, Fan Zhou, Siqiao Xue
TL;DR
DORAEMON tackles fragmentation in large-scale visual object modeling by providing a unified, YAML-driven PyTorch framework that supports image classification, face recognition, and retrieval through a shared backbone and modular heads. It exposes over $1{,}000$ pretrained backbones via the timm ecosystem, enables elastic distributed training, and offers one-click exports to ONNX, TensorRT, or TorchScript, bridging research and deployment. The framework achieves reproducible baselines on ImageNet-1K, MS-Celeb-1M, and Stanford Online Products, demonstrating strong cross-task transferability and scalability. With HuggingFace integration, Grad-CAM visualizations, and a configurable pipeline, DORAEMON lowers the barrier to rapid experimentation and production deployment, while positioning itself as a platform for future multimodal, language-informed vision systems.
Abstract
DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.
