Table of Contents
Fetching ...

Learning to Detect Baked Goods with Limited Supervision

Thomas H. Schmitt, Maximilian Bundscherer, Tobias Bocklet

TL;DR

This work tackles automatic counting and detection of German baked goods under scarce annotations. It proposes two limited-supervision pipelines: weakly supervised localization by open-vocabulary detectors guided by image-level labels, and pseudo-labeling via Segment Anything 2 to leverage video frames for robust viewpoint handling. Experiments on a real-world, 19-class bakery dataset show that image-level supervision with post-processing yields a strong mAP (≈0.91), while pseudo-labeling improves robustness by ≈19 percentage points and can surpass fully supervised baselines under non-ideal deployment. The study offers practical, on-device workflows that address industry-specific data scarcity and deployment constraints, with broader implications for adapting CV in highly specialized production environments.

Abstract

Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.

Learning to Detect Baked Goods with Limited Supervision

TL;DR

This work tackles automatic counting and detection of German baked goods under scarce annotations. It proposes two limited-supervision pipelines: weakly supervised localization by open-vocabulary detectors guided by image-level labels, and pseudo-labeling via Segment Anything 2 to leverage video frames for robust viewpoint handling. Experiments on a real-world, 19-class bakery dataset show that image-level supervision with post-processing yields a strong mAP (≈0.91), while pseudo-labeling improves robustness by ≈19 percentage points and can surpass fully supervised baselines under non-ideal deployment. The study offers practical, on-device workflows that address industry-specific data scarcity and deployment constraints, with broader implications for adapting CV in highly specialized production environments.

Abstract

Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.
Paper Structure (17 sections, 5 figures, 6 tables)

This paper contains 17 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Relative class distributions of baked goods across Deployment ($D$), Single-Class ($C_{\text{train}}$) and Test Video ($V_{\text{test}}$) Images.
  • Figure 2: Sample deployment image captured by bakery staff using the iOS application.
  • Figure 3: Effect of our post-processing on a Grounding DINO prediction for a single-class image containing the baked good Apfel-tasche. Filtered bounding boxes are shown in orange.
  • Figure 4: Model mAP on $A_{\text{test}}$ as a function of the relative camera angle, illustrating the effect of viewpoint variation on detection performance.
  • Figure 5: Predictions on an $A_{\text{test}}$ image captured at a $60^\circ$ relative to the top-down view. Green bounding boxes mark correct detections by the fine-tuned model; blue boxes mark detections correctly made by both the fine-tuned and baseline model.