Table of Contents
Fetching ...

ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts

Sangbum Choi, Kyeongryeol Go, Taewoong Jang

TL;DR

ZERO tackles the gap between academic foundation models and industrial zero-shot deployment by introducing a data-efficient, multi-modal prompting approach. A data engine constructs a compact 0.9M annotated dataset from a billion-scale industrial corpus, while a training strategy combines distillation and contrastive learning to align text and visual prompts, enabling decoupled inference at deploy time. Empirical results show strong zero-shot performance on LVIS-Val and superior generalization across 37 industrial datasets, with top-5 placements in CVPR 2025 challenges (2nd in InsDet and 4th in FSOD). These findings demonstrate that domain-specific, zero-shot object detection can be achieved with minimal labeled data and without retraining, offering practical benefits for enterprise AI.

Abstract

Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.

ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts

TL;DR

ZERO tackles the gap between academic foundation models and industrial zero-shot deployment by introducing a data-efficient, multi-modal prompting approach. A data engine constructs a compact 0.9M annotated dataset from a billion-scale industrial corpus, while a training strategy combines distillation and contrastive learning to align text and visual prompts, enabling decoupled inference at deploy time. Empirical results show strong zero-shot performance on LVIS-Val and superior generalization across 37 industrial datasets, with top-5 placements in CVPR 2025 challenges (2nd in InsDet and 4th in FSOD). These findings demonstrate that domain-specific, zero-shot object detection can be achieved with minimal labeled data and without retraining, offering practical benefits for enterprise AI.

Abstract

Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.

Paper Structure

This paper contains 23 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed ZERO training pipeline. The pipeline is composed of a data engine (left), which constructs a compact and richly annotated industrial dataset through collection, selection, and pseudo-labeling; and a training strategy (right) that progressively adapts a pretrained open-vocabulary detector using distillation and alignment, ultimately enabling decoupled inference with either text or visual prompt.
  • Figure 2: Qualitative examples of ZERO in zero-shot detection settings using different types of prompts.