Multi-Granularity Hand Action Detection

Ting Zhe; Jing Zhang; Yongqian Li; Yong Luo; Han Hu; Dacheng Tao

Multi-Granularity Hand Action Detection

Ting Zhe, Jing Zhang, Yongqian Li, Yong Luo, Han Hu, Dacheng Tao

TL;DR

This paper tackles the challenge of fine-grained hand action detection in kitchen videos by introducing FHA-Kitchens, a large-scale dataset with both coarse and fine-grained hand action labels and precise localization for hand interaction regions. It also presents MG-HAD, an end-to-end DETR-based detector that handles multi-granularity action information via Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising, improving performance on both coarse and fine-grained labels. The authors provide extensive dataset statistics, a three-track benchmark (SL-AD, SL-AR, DG), and thorough ablations showing the effectiveness of the proposed designs. Overall, the work establishes a valuable new dataset and a strong baseline for multi-granularity hand action detection with potential impact on HCI, robotics, and video understanding tasks.

Abstract

Detecting hand actions in videos is crucial for understanding video content and has diverse real-world applications. Existing approaches often focus on whole-body actions or coarse-grained action categories, lacking fine-grained hand-action localization information. To fill this gap, we introduce the FHA-Kitchens (Fine-Grained Hand Actions in Kitchen Scenes) dataset, providing both coarse- and fine-grained hand action categories along with localization annotations. This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories. Evaluation of existing action detection methods on FHA-Kitchens reveals varying generalization capabilities across different granularities. To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. It incorporates two new designs: Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising. Extensive experiments demonstrate MG-HAD's effectiveness for multi-granularity hand action detection, highlighting the significance of FHA-Kitchens for future research and real-world applications. The dataset and source code are available at https://github.com/superZ678/MG-HAD.

Multi-Granularity Hand Action Detection

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 18 figures, 15 tables)

This paper contains 35 sections, 5 equations, 18 figures, 15 tables.

Introduction
Related work
AR & AD Dataset
AR & AD Method
FHA-Kitchens Dataset
Data Collection And Organization
Data Annotation
Statistics of the FHA-Kitchens Dataset
Benchmark Setup
A Simple Yet Strong Baseline
A Multi-Granularity Framework
Multi-Dimensional Action Queries
Coarse-Fine Contrastive Denoising
Experiments
Experiments Settings
...and 20 more sections

Figures (18)

Figure 1: Overview of the FHA-Kitchens dataset. (a) The annotation of hand actions in existing relevant datasets, where UCF101 3ucf101 and Kinetics700 700_2020 are whole-body action datasets, while MPII Cooking 2MPII and EPIC KITCHENS 1epic are hand action datasets. (b) The annotation of hand actions in our dataset. The left shows some frames extracted from 8 dish categories. The right illustrates the annotation process of hand actions in "fry vegetable".
Figure 2: An overview of the action verbs and their parent action categories in FHA-Kitchens.
Figure 3: The distribution of instances per action verb category (the outer ring in Figure \ref{['fig:cakefig-actionverbs']}) in the FHA-Kitchens dataset.
Figure 4: The distribution of instances per object noun category from 17 super-categories in the FHA-Kitchens dataset.
Figure 5: The overall architecture of MG-HAD, a novel end-to-end hand action detection model based on DINO dino. The improvements mainly focus on the decoder part. Specifically, (1) we introduce a new design for the content query part, transforming the original single-dimensional content queries into multi-dimensional ones. They are further processed by the designed CQR module, combined with initialized anchors, and inputted into the decoder. The outputted three query sets with different action dimensions go through the Split & Integration module to generate $N$ queries containing three action dimensions. Finally, the matching process is conducted to predict hand action results (see Section \ref{['4.2']}); (2) we introduce a C-F CDN training approach, which involves adding coarse- and fine-grained noise to labels to generate four types of CDN queries for contrastive denoising training (see Section \ref{['4.3']}). F: Fine-grained, C: Coarse-grained, Multi-G: Multi-granularity.
...and 13 more figures

Multi-Granularity Hand Action Detection

TL;DR

Abstract

Multi-Granularity Hand Action Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (18)