Table of Contents
Fetching ...

Temporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting

Andrea Marelli, Luca Magri, Federica Arrigoni, Giacomo Boracchi

TL;DR

The paper tackles weakly supervised video segmentation for industrial waste sorting by learning temporally coherent CAMs. It introduces a dual-camera before/after setup, background removal to reduce bias, and a reconstruction loss that aligns saliency maps across adjacent frames via optical-flow warping, incorporating spatial coherence from PuzzleCAM. The method yields state-of-the-art segmentation on SERUSO and demonstrates that enforcing temporal coherence during training significantly improves CAM quality and consistency, with favorable classification results when background is removed. Practically, this approach enables accurate, temporally stable segmentation without pixel-level annotations, offering a scalable solution for automated waste sorting and other industrial processes.

Abstract

In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.

Temporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting

TL;DR

The paper tackles weakly supervised video segmentation for industrial waste sorting by learning temporally coherent CAMs. It introduces a dual-camera before/after setup, background removal to reduce bias, and a reconstruction loss that aligns saliency maps across adjacent frames via optical-flow warping, incorporating spatial coherence from PuzzleCAM. The method yields state-of-the-art segmentation on SERUSO and demonstrates that enforcing temporal coherence during training significantly improves CAM quality and consistency, with favorable classification results when background is removed. Practically, this approach enables accurate, temporally stable segmentation without pixel-level annotations, offering a scalable solution for automated waste sorting and other industrial processes.

Abstract

In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.

Paper Structure

This paper contains 21 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Two cameras, $C_1$ and $C_2$, are placed along a conveyor belt where a human operator manually removes illegal objects. Camera $C_1$ captures the belt section before the operator’s intervention, while Camera $C_2$ captures the section after, where only legal objects remain. Given a "before" image, our goal is to accurately segment objects into two categories: legal objects and illegal objects that should be removed.
  • Figure 2: Problem formulation: (a) an RGB input image $X$ is processed to generate an accurate output mask $M_X$. This mask classifies each pixel as illegal (red), or background (blue). (b) Training set comprising "before" and "after" videos. "Before" videos capture the conveyor belt before human intervention. "After" videos capture the belt after non-colored PET objects have been removed.
  • Figure 3: (a) Comparison of images with background, without background and the extracted background itself, which is used to generate a third independent class, respectively. (b) By shifting from the $\Lambda$ class domain to the $\hat{\Lambda}$ class domain, we can not only distinguish between illegal (red) and background (blue) elements, but also segment legal (green) objects with a new, more specific, label, distinguishing them from the empty belt regions.
  • Figure 4: Main pipeline illustration. The overall workflow of our network, which processes a triplet of frames ($X_{t-1}$, $X_t$, $X_{t+1}$). The spatial module (PuzzleCAM jo2021puzzle) outputs a reconstructed feature space $f_t^{puzzle}$ which is pushed to match the original feature space $f_t$ by $\mathcal{L}_{\text{spatial}}$. The temporal module outputs a new saliency map $M_t^{\text{fused}}$ for the central frame $X_t$, obtained from the features of the adjacent frames $X_{t+1}$ and $X_{t-1}$. $M_t^{\text{fused}}$ is then pushed to match the original map $M_t$ by the reconstruction loss $\mathcal{L}_{\text{temporal}}$. $\mathcal{L}_{cls}$ and $\mathcal{L}_{p-cls}$ are instead the classification losses. The computation of the four losses of the network is described in Sec. \ref{['sec:losses']}, while spatial and temporal modules are detailed in Fig. \ref{['fig:spatial_module']} and \ref{['fig:temporal_module']}, respectively.
  • Figure 5: Spatial Module: The central frame $X_t$ is divided into non-overlapping patches by the tiling module, and for each patch, we extract its feature maps. These sub-feature maps are then re-merged to create a single reconstructed feature space that is compared with the one of the original image $X_t$ through the reconstruction loss $\mathcal{L_{\text{spatial}}}$. This module is the implementation of PuzzleCAM and it aims to improve segmentation by focusing on the spatial arrangement of objects within a single frame.
  • ...and 2 more figures