Table of Contents
Fetching ...

Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

Keno Moenck, Duc Trung Thieu, Julian Koch, Thorsten Schüppstuhl

TL;DR

This work addresses the scarcity of industrial, multimodal data for Vision Foundation Models and the difficulty of applying CLIP-like models to industry without abundant labeled data. It introduces ILID, a web-crawled dataset of 12,537 image-text pairs from industrial catalogs, and demonstrates self-supervised transfer learning using adapters and prompt-learning to specialize CLIP for industrial tasks. The study shows that transfer-learning approaches outperform zero-shot CLIP on material prompting and language-guided segmentation, highlighting practical gains for manufacturing inspection and automation. By providing a complete data-generation pipeline and open-source code, it lays a path for broader adoption of vision-language models in industrial settings.

Abstract

In recent years, the upstream of Large Language Models (LLM) has also encouraged the computer vision community to work on substantial multimodal datasets and train models on a scale in a self-/semi-supervised manner, resulting in Vision Foundation Models (VFM), as, e.g., Contrastive Language-Image Pre-training (CLIP). The models generalize well and perform outstandingly on everyday objects or scenes, even on downstream tasks, tasks the model has not been trained on, while the application in specialized domains, as in an industrial context, is still an open research question. Here, fine-tuning the models or transfer learning on domain-specific data is unavoidable when objecting to adequate performance. In this work, we, on the one hand, introduce a pipeline to generate the Industrial Language-Image Dataset (ILID) based on web-crawled data; on the other hand, we demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID, which does not necessitate human labeling or intervention. With the proposed approach, we contribute by transferring approaches from state-of-the-art research around foundation models, transfer learning strategies, and applications to the industrial domain.

Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings

TL;DR

This work addresses the scarcity of industrial, multimodal data for Vision Foundation Models and the difficulty of applying CLIP-like models to industry without abundant labeled data. It introduces ILID, a web-crawled dataset of 12,537 image-text pairs from industrial catalogs, and demonstrates self-supervised transfer learning using adapters and prompt-learning to specialize CLIP for industrial tasks. The study shows that transfer-learning approaches outperform zero-shot CLIP on material prompting and language-guided segmentation, highlighting practical gains for manufacturing inspection and automation. By providing a complete data-generation pipeline and open-source code, it lays a path for broader adoption of vision-language models in industrial settings.

Abstract

In recent years, the upstream of Large Language Models (LLM) has also encouraged the computer vision community to work on substantial multimodal datasets and train models on a scale in a self-/semi-supervised manner, resulting in Vision Foundation Models (VFM), as, e.g., Contrastive Language-Image Pre-training (CLIP). The models generalize well and perform outstandingly on everyday objects or scenes, even on downstream tasks, tasks the model has not been trained on, while the application in specialized domains, as in an industrial context, is still an open research question. Here, fine-tuning the models or transfer learning on domain-specific data is unavoidable when objecting to adequate performance. In this work, we, on the one hand, introduce a pipeline to generate the Industrial Language-Image Dataset (ILID) based on web-crawled data; on the other hand, we demonstrate effective self-supervised transfer learning and discussing downstream tasks after training on the cheaply acquired ILID, which does not necessitate human labeling or intervention. With the proposed approach, we contribute by transferring approaches from state-of-the-art research around foundation models, transfer learning strategies, and applications to the industrial domain.
Paper Structure (23 sections, 14 figures, 2 tables)

This paper contains 23 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: CLIP on the task of classification after (a) transfer learning on the Industrial Language-Image Dataset (ILID) and (b) the zero-shot baseline results.
  • Figure 2: Overview of this work's method: (1) generation of the Industrial Language-Image Dataset (ILID), (2) transfer learning using the ILID, and (3) evaluating the performance in different tasks.
  • Figure 3: Joint embedding space of text and image representations: conceptually similar texts and images are encoded close to each other, dissimilar pairings do not share similar positions.
  • Figure 4: Dataset generation pipeline resulting in the Industrial Language-Image Dataset (ILID).
  • Figure 5: The architectures used in this work: (a) CLIPAdapter Gao.2021 and (b) CoOp Zhou.2021.
  • ...and 9 more figures