Table of Contents
Fetching ...

Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)

Zane Durante, Robathan Harries, Edward Vendrow, Zelun Luo, Yuta Kyuragi, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli

TL;DR

It is shown that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks.

Abstract

Understanding Activities of Daily Living (ADLs) is a crucial step for different applications including assistive robots, smart homes, and healthcare. However, to date, few benchmarks and methods have focused on complex ADLs, especially those involving multi-person interactions in home environments. In this paper, we propose a new dataset and benchmark, InteractADL, for understanding complex ADLs that involve interaction between humans (and objects). Furthermore, complex ADLs occurring in home environments comprise a challenging long-tailed distribution due to the rarity of multi-person interactions, and pose fine-grained visual recognition tasks due to the presence of semantically and visually similar classes. To address these issues, we propose a novel method for fine-grained few-shot video classification called Name Tuning that enables greater semantic separability by learning optimal class name vectors. We show that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks. For transparency and reproducibility, we release our code at https://github.com/zanedurante/vlm_benchmark.

Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)

TL;DR

It is shown that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks.

Abstract

Understanding Activities of Daily Living (ADLs) is a crucial step for different applications including assistive robots, smart homes, and healthcare. However, to date, few benchmarks and methods have focused on complex ADLs, especially those involving multi-person interactions in home environments. In this paper, we propose a new dataset and benchmark, InteractADL, for understanding complex ADLs that involve interaction between humans (and objects). Furthermore, complex ADLs occurring in home environments comprise a challenging long-tailed distribution due to the rarity of multi-person interactions, and pose fine-grained visual recognition tasks due to the presence of semantically and visually similar classes. To address these issues, we propose a novel method for fine-grained few-shot video classification called Name Tuning that enables greater semantic separability by learning optimal class name vectors. We show that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks. For transparency and reproducibility, we release our code at https://github.com/zanedurante/vlm_benchmark.
Paper Structure (23 sections, 2 equations, 7 figures, 14 tables)

This paper contains 23 sections, 2 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Our fine-grained visual recognition dataset InteractADL poses a challenging classification task for Visual Language Models (VLMs). Left: a sample timeline of multiple ADLs in our InteractADL dataset. We show two $3^{\text{rd}}$ person views along with annotated higher-level activity labels, temporal atomic actions, and dense spatiotemporal scene graphs. Right: dual encoder VLMs have a joint video-language embedding space, in which similarity scores between an input video and category names are computed. For these VLMs, we seek to use a small number of training examples to learn better class names that provide greater semantic separation and improved classification performance.
  • Figure 2: Comparison of input-text optimization methods for dual encoder VLMs. We contrast the methods presented in this work, Name Tuning and CoNa, with standard prompt engineering and prompt tuning (CoOp). We introduce learnable offset vectors to fine-tune class names in the input text.
  • Figure 3: We detail the floorplan of the home environment used to collect data for InteractADL. InteractADL is recorded in a real home environment with sensors (represented in red) placed throughout the many rooms to provide a subset of views for each long-term video, including ego-view, $3^\text{rd}$ person, and top-down. Ego-view is provided for all videos, and at least two $3^\text{rd}$ person views (including ceiling views) are provided for each video.
  • Figure 4: We show three example scene graph annotations from InteractADL corresponding to three separate rooms in the home. Our annotations cover the primary household objects and actors in the scene and interactions relevant for ADLs.
  • Figure 5: Name Tuning performance on the MOMA Sub-activities dataset using various prompts and CLIP as a pre-trained backbone. We label the prompt besides Name Tuning (NT) in the legend and show linear probe and CoOp as baselines.
  • ...and 2 more figures