ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts
Sangbum Choi, Kyeongryeol Go, Taewoong Jang
TL;DR
ZERO tackles the gap between academic foundation models and industrial zero-shot deployment by introducing a data-efficient, multi-modal prompting approach. A data engine constructs a compact 0.9M annotated dataset from a billion-scale industrial corpus, while a training strategy combines distillation and contrastive learning to align text and visual prompts, enabling decoupled inference at deploy time. Empirical results show strong zero-shot performance on LVIS-Val and superior generalization across 37 industrial datasets, with top-5 placements in CVPR 2025 challenges (2nd in InsDet and 4th in FSOD). These findings demonstrate that domain-specific, zero-shot object detection can be achieved with minimal labeled data and without retraining, offering practical benefits for enterprise AI.
Abstract
Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.
