Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings
Keno Moenck, Duc Trung Thieu, Julian Koch, Thorsten Schüppstuhl
TL;DR
This work addresses the scarcity of industrial, multimodal data for Vision Foundation Models and the difficulty of applying CLIP-like models to industry without abundant labeled data. It introduces ILID, a web-crawled dataset of 12,537 image-text pairs from industrial catalogs, and demonstrates self-supervised transfer learning using adapters and prompt-learning to specialize CLIP for industrial tasks. The study shows that transfer-learning approaches outperform zero-shot CLIP on material prompting and language-guided segmentation, highlighting practical gains for manufacturing inspection and automation. By providing a complete data-generation pipeline and open-source code, it lays a path for broader adoption of vision-language models in industrial settings.
Abstract
In recent years, the upstream of Large Language Models (LLM) has also encouraged the computer vision community to work on substantial multimodal datasets and train models on a scale in a self-/semi-supervised manner, resulting in Vision Foundation Models (VFM), as, e.g., Contrastive Language-Image Pre-training (CLIP). The models generalize well and perform outstandingly on everyday objects or scenes, even on downstream tasks, tasks the model has not been trained on, while the application in specialized domains, as in an industrial context, is still an open research question. Here, fine-tuning the models or transfer learning on domain-specific data is unavoidable when objecting to adequate performance. In this work, we, on the one hand, introduce a pipeline to generate the Industrial Language-Image Dataset (ILID) based on web-crawled data; on the other hand, we demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID, which does not necessitate human labeling or intervention. With the proposed approach, we contribute by transferring approaches from state-of-the-art research around foundation models, transfer learning strategies, and applications to the industrial domain.
