Table of Contents
Fetching ...

Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa, Kunihito Kato

TL;DR

This work addresses the challenge of performing visual inspection on new products without extensive retraining by integrating Vision-Language Models with In-Context Learning. A ViP-LLaVA-based framework is fine-tuned on a curated dataset of non-defective and defective products, and ICL prompts with explanatory criteria guide the inspection of unseen items, producing defect coordinates for localization. The approach achieves MCC=$0.804$ and F1-score=$0.950$ on MVTec AD in a one-shot setting and demonstrates improvements over baselines on VisA, while also analyzing exemplar selection and limitations. The study provides a path toward generalizable, explainable visual inspection with minimal task-specific data, and releases code and data to foster broader adoption and further improvements.

Abstract

We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.

Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

TL;DR

This work addresses the challenge of performing visual inspection on new products without extensive retraining by integrating Vision-Language Models with In-Context Learning. A ViP-LLaVA-based framework is fine-tuned on a curated dataset of non-defective and defective products, and ICL prompts with explanatory criteria guide the inspection of unseen items, producing defect coordinates for localization. The approach achieves MCC= and F1-score= on MVTec AD in a one-shot setting and demonstrates improvements over baselines on VisA, while also analyzing exemplar selection and limitations. The study provides a path toward generalizable, explainable visual inspection with minimal task-specific data, and releases code and data to foster broader adoption and further improvements.

Abstract

We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.

Paper Structure

This paper contains 20 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Framework of our proposed method. We utilize ICL for multiple image inputs to give VLM the inspection criteria of new products. Our framework gives the coordinates of the defective location, which helps the user understand the model's decision. In addition, it is easy to address by replacing the foundational model when a better VLM is proposed.
  • Figure 2: Architecture of ViP-LLaVA. After providing an image and the corresponding text, the image is tokenized by CLIP ViT, LayerNorm, and MLP layers, while the text is tokenized by tokenizer. Then the visual tokens and the text tokens are given to the LLM to generate the answer.
  • Figure 3: Examples of non-defective and defective images of "Pill" in MVTec AD and "Capsules" in VisA. For "Pill", the non-defective image also contains red spots, making it difficult to inspect Similarly, for "Capsules", the non-defective image also contains brown stains.
  • Figure 4: Framework of evaluation. First, select the example based on Eq. (1), then infer the test image with ICL.
  • Figure 5: Visualize the model prediction for MVTec AD.
  • ...and 6 more figures