Coarse or Fine? Recognising Action End States without Labels
Davide Moltisanti, Hakan Bilen, Laura Sevilla-Lara, Frank Keller
TL;DR
This work addresses end-state recognition of actions in images, focusing on distinguishing coarse versus fine cuts without labeled data. It introduces VOST-AUG, an object-agnostic data synthesis pipeline that transforms few whole-object images into thousands of cut-like samples by Voronoi-based region breaking and region shifting, enabling robust learning from synthetic data. A UNet-based model with an auxiliary segmentation task learns a continuous coarseness measure $c$ from a single image, using $c = \frac{|M_a - M_o|}{\sum (M_a \lor M_o)}$ where $M_a$ and $M_o$ are binary masks, and employs an encoder+MLP at inference for end-state prediction. Trained on VOST-AUG, the method generalizes well to real images and unseen objects, outperforming baselines on COFICUT and AIR datasets and demonstrating that end-state recognition can be learned with synthetic data across domain gaps.
Abstract
We focus on the problem of recognising the end state of an action in an image, which is critical for understanding what action is performed and in which manner. We study this focusing on the task of predicting the coarseness of a cut, i.e., deciding whether an object was cut "coarsely" or "finely". No dataset with these annotated end states is available, so we propose an augmentation method to synthesise training data. We apply this method to cutting actions extracted from an existing action recognition dataset. Our method is object agnostic, i.e., it presupposes the location of the object but not its identity. Starting from less than a hundred images of a whole object, we can generate several thousands images simulating visually diverse cuts of different coarseness. We use our synthetic data to train a model based on UNet and test it on real images showing coarsely/finely cut objects. Results demonstrate that the model successfully recognises the end state of the cutting action despite the domain gap between training and testing, and that the model generalises well to unseen objects.
