Temporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting
Andrea Marelli, Luca Magri, Federica Arrigoni, Giacomo Boracchi
TL;DR
The paper tackles weakly supervised video segmentation for industrial waste sorting by learning temporally coherent CAMs. It introduces a dual-camera before/after setup, background removal to reduce bias, and a reconstruction loss that aligns saliency maps across adjacent frames via optical-flow warping, incorporating spatial coherence from PuzzleCAM. The method yields state-of-the-art segmentation on SERUSO and demonstrates that enforcing temporal coherence during training significantly improves CAM quality and consistency, with favorable classification results when background is removed. Practically, this approach enables accurate, temporally stable segmentation without pixel-level annotations, offering a scalable solution for automated waste sorting and other industrial processes.
Abstract
In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real-world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.
