Table of Contents
Fetching ...

Coarse or Fine? Recognising Action End States without Labels

Davide Moltisanti, Hakan Bilen, Laura Sevilla-Lara, Frank Keller

TL;DR

This work addresses end-state recognition of actions in images, focusing on distinguishing coarse versus fine cuts without labeled data. It introduces VOST-AUG, an object-agnostic data synthesis pipeline that transforms few whole-object images into thousands of cut-like samples by Voronoi-based region breaking and region shifting, enabling robust learning from synthetic data. A UNet-based model with an auxiliary segmentation task learns a continuous coarseness measure $c$ from a single image, using $c = \frac{|M_a - M_o|}{\sum (M_a \lor M_o)}$ where $M_a$ and $M_o$ are binary masks, and employs an encoder+MLP at inference for end-state prediction. Trained on VOST-AUG, the method generalizes well to real images and unseen objects, outperforming baselines on COFICUT and AIR datasets and demonstrating that end-state recognition can be learned with synthetic data across domain gaps.

Abstract

We focus on the problem of recognising the end state of an action in an image, which is critical for understanding what action is performed and in which manner. We study this focusing on the task of predicting the coarseness of a cut, i.e., deciding whether an object was cut "coarsely" or "finely". No dataset with these annotated end states is available, so we propose an augmentation method to synthesise training data. We apply this method to cutting actions extracted from an existing action recognition dataset. Our method is object agnostic, i.e., it presupposes the location of the object but not its identity. Starting from less than a hundred images of a whole object, we can generate several thousands images simulating visually diverse cuts of different coarseness. We use our synthetic data to train a model based on UNet and test it on real images showing coarsely/finely cut objects. Results demonstrate that the model successfully recognises the end state of the cutting action despite the domain gap between training and testing, and that the model generalises well to unseen objects.

Coarse or Fine? Recognising Action End States without Labels

TL;DR

This work addresses end-state recognition of actions in images, focusing on distinguishing coarse versus fine cuts without labeled data. It introduces VOST-AUG, an object-agnostic data synthesis pipeline that transforms few whole-object images into thousands of cut-like samples by Voronoi-based region breaking and region shifting, enabling robust learning from synthetic data. A UNet-based model with an auxiliary segmentation task learns a continuous coarseness measure from a single image, using where and are binary masks, and employs an encoder+MLP at inference for end-state prediction. Trained on VOST-AUG, the method generalizes well to real images and unseen objects, outperforming baselines on COFICUT and AIR datasets and demonstrating that end-state recognition can be learned with synthetic data across domain gaps.

Abstract

We focus on the problem of recognising the end state of an action in an image, which is critical for understanding what action is performed and in which manner. We study this focusing on the task of predicting the coarseness of a cut, i.e., deciding whether an object was cut "coarsely" or "finely". No dataset with these annotated end states is available, so we propose an augmentation method to synthesise training data. We apply this method to cutting actions extracted from an existing action recognition dataset. Our method is object agnostic, i.e., it presupposes the location of the object but not its identity. Starting from less than a hundred images of a whole object, we can generate several thousands images simulating visually diverse cuts of different coarseness. We use our synthetic data to train a model based on UNet and test it on real images showing coarsely/finely cut objects. Results demonstrate that the model successfully recognises the end state of the cutting action despite the domain gap between training and testing, and that the model generalises well to unseen objects.
Paper Structure (31 sections, 3 equations, 11 figures, 5 tables)

This paper contains 31 sections, 3 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Summary of our work. We aim to recognise the end state of an action, e.g., whether an object is cut coarsely or finely. We assume no labels and propose an object-agnostic image augmentation method to synthesise training data. Our model successfully learns from this synthetic data, as we show by testing on real images and videos, including for unseen objects.
  • Figure 2: Trying to generate images of coarsely/finely cut objects with InstructPix2Pix brooks2023instructpix2pix. Text indicates the prompts used.
  • Figure 3: Our augmentation method to transform whole objects into cut objects. Given an image and a mask segmenting the object, we first remove the object and inpaint the image to fill the resulting hole (image w/o object, bottom left). We then split the object into regions (Step 1). For this we sample $n$ seeding points (nine in this example, indicated by circles) and group object pixels into regions based on their distance to each point, as in a Voronoi diagram. We devise four sampling strategies which affect the topology of the regions and simulate different cut types. We then "break" regions given a reference point (Step 2), shown as a red dot, i.e., we push each region away from the reference point along the line connecting the region and the point. Lastly (Step 3), we overlay the new regions onto the image w/o object to obtain the final augmented image. We show four examples with reference point (centre, middle) and each of the four sampling strategies.
  • Figure 4: Illustrating how the parameters of our augmentation affect the output image. The number of seeding points controls the coarseness of the simulated cut, with fewer/more points corresponding to a coarser/finer cut (left). To obtain more diversified and realistic images we push regions by a random number of pixels sampled within an interval (centre) and add noise to the seeding points (right).
  • Figure 5: Our model to predict the coarseness of a cut. The model adopts a UNet architecture, where the Encoder bottleneck features $z$ are optimised in two ways. We use an MLP to predict coarseness given $z$ with the L1 loss, using $c$ as target. To learn a stronger $z$, the UNet decoder adds an auxiliary segmentation task, where we use the augmented object mask as target. The decoder is used only during training. For inference we employ only the Encoder and the MLP output to predict the coarseness of a test image.
  • ...and 6 more figures