Table of Contents
Fetching ...

Automated Counting of Stacked Objects in Industrial Inspection

Corentin Dumery, Noa Etté, Aoxiang Fan, Ren Li, Jingyi Xu, Hieu Le, Pascal Fua

Abstract

Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.

Automated Counting of Stacked Objects in Industrial Inspection

Abstract

Visual object counting is a fundamental computer vision task in industrial inspection, where accurate, high-throughput inventory tracking and quality assurance are critical. Moreover, manufactured parts are often too light to reliably deduce their count from their weight, or too heavy to move the stack on a scale safely and practically, making automated visual counting the more robust solution in many scenarios. However, existing methods struggle with stacked 3D items in containers, pallets, or bins, where most objects are heavily occluded and only a few are directly visible. To address this important yet underexplored challenge, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems: estimating the 3D geometry of the stack and its occupancy ratio from multi-view images. By combining geometric reconstruction with deep learning-based depth analysis, our method can accurately count identical manufactured parts inside containers, even when they are irregularly stacked and partially hidden. We validate our 3D counting pipeline on large-scale synthetic and diverse real-world data with manually verified total counts, demonstrating robust performance under realistic inspection conditions.
Paper Structure (25 sections, 9 equations, 14 figures, 6 tables)

This paper contains 25 sections, 9 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: 3D Counting (3DC). We estimate both the total volume occupied by the stack and the fraction of this volume taken up by the objects from multiple views of objects to be counted. Combining these estimates yields the total number of objects.
  • Figure 2: 3DC pipeline. We decompose the counting task into estimating the volume of the objects to be counted and then estimating the occupancy ratio within that volume. The first is done on the basis of geometry reconstructed from segmentations in multiple images.The second uses as input a depth-map computed by a monocular depth estimator and regresses an occupancy ratio from it.
  • Figure 3: Shape-dependent volume occupancy. We show a slice of the stacked objects for varying shapes. Different shapes yield different configurations and the fraction of space occupied by the gaps between objects varies. This is reflected in the images, where deeper layers remain visible for low $\gamma$ values.
  • Figure 4: Dataset samples. We visualize generated scenes in ascending order of occupancy ratio, with ground-truth depth maps.
  • Figure 5: Reducing the domain gap. Instead of estimating the occupancy ratio $\gamma$ from synthetic (top) and real images (bottom) (a), we identify a key view (b) and train a network to predict $\gamma$ from their depth maps (c), which are indistinguishable. Top row: synthetic, $\gamma_{gt} = 62.4\%$. Bottom row: chocolates, $\gamma_{est} = 53.5\%$, $\mathcal{N}_{est} = 119$, $\mathcal{N}_{gt} = 131$.
  • ...and 9 more figures