Table of Contents
Fetching ...

FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

Eadom Dessalene, Botao He, Michael Maynord, Yonatan Tussa, Pavan Mantripragada, Yianni Karabati, Nirupam Roy, Yiannis Aloimonos

Abstract

We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

Abstract

We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.
Paper Structure (10 sections, 7 figures, 4 tables)

This paper contains 10 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: FEEL pairs egocentric video with force measurements to capture the physical causes—not just visual effects—of hand-object interaction.
  • Figure 2: Forces reveal interaction dynamics. Representative sequences showing how force measurements disambiguate physical interactions. Left: Grasping and lifting a pan exhibits distinct force signatures with an onset at the end of the reach and a spike as the pan is lifted. Right: Stirring produces rhythmic force patterns reflecting cyclic manipulation.
  • Figure 3: Contact detection from force measurements. Despite visual similarity across frames, force measurements precisely identify contact boundaries. Top: Raw sensor forces (pinky and middle finger dominate this grasp). Middle: Consolidated force with dual thresholds—above C threshold indicates contact, below NC threshold indicates non-contact, between is ambiguous (excluded). Forces enable scalable contact supervision without manual annotation of ambiguous visual transitions.
  • Figure 4: Force-sensing glove hardware. Leftmost: Palm-facing views of the left and right gloves, showing the six piezoresistive sensors ( red) mounted at each fingertip and across the palm, the Arduino microcontroller case ( blue) mounted at the wrist, and the on/off switch ( green). Middle and right: Side and in-use views demonstrating the glove's low-profile form factor during natural object manipulation.
  • Figure 5: Learning contacts and action from forces.(a) Contact Understanding Image network trained with force-derived contact labels performs contact detection and segmentation (red mask) for each hand (L/R). (b) Action Representation Learning Video network predicts per-hand forces from clips for pretraining. The video backbone is transferred to action recognition after discarding force heads. No manual contact or action labels required during pre-training.
  • ...and 2 more figures