Table of Contents
Fetching ...

Multi-task Learning For Joint Action and Gesture Recognition

Konstantinos Spathis, Nikolaos Kardaris, Petros Maragos

TL;DR

The paper addresses the need for efficient joint recognition of human actions and gestures in human–robot interaction by proposing a multi-task learning framework that shares representations across both tasks. It systematically analyzes weight-sharing strategies (hard, soft with cross-stitch, and learned weight sharing) and multi-task loss formulations, introducing NES-based layer-task assignment for LWS. Across four composite AR/GR datasets, the study demonstrates that most MTL configurations outperform single-task baselines, with cross-stitch style soft sharing and learned sharing delivering the strongest gains, particularly when actions are more prevalent. The findings indicate that MTL improves generalization and efficiency for action and gesture recognition, while also highlighting the impact of dataset composition and the need for task-label supervision in input samples; future work aims to develop a task-agnostic network for even tighter task integration.

Abstract

In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.

Multi-task Learning For Joint Action and Gesture Recognition

TL;DR

The paper addresses the need for efficient joint recognition of human actions and gestures in human–robot interaction by proposing a multi-task learning framework that shares representations across both tasks. It systematically analyzes weight-sharing strategies (hard, soft with cross-stitch, and learned weight sharing) and multi-task loss formulations, introducing NES-based layer-task assignment for LWS. Across four composite AR/GR datasets, the study demonstrates that most MTL configurations outperform single-task baselines, with cross-stitch style soft sharing and learned sharing delivering the strongest gains, particularly when actions are more prevalent. The findings indicate that MTL improves generalization and efficiency for action and gesture recognition, while also highlighting the impact of dataset composition and the need for task-label supervision in input samples; future work aims to develop a task-agnostic network for even tighter task integration.

Abstract

In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.

Paper Structure

This paper contains 12 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An instance of a hard parameter sharing model for three tasks. The first layers of the model (gray color) are common for all the tasks, while the last layers (yellow color) are task-specific.
  • Figure 2: Cross Stitch Units applied on two task specific CNNs. The task specific models, illustrated with gray color, are connected with units, depicted with yellow color, which control the information shared between the two tasks. The term MTL Layer is used to describe all the layers of the different task specific networks at a certain layer of the model.
  • Figure 3: (a) Representation of an LWS architecture for two tasks. Layers between task-specific networks are compatible for all tasks, so the model learns which parameters will be used in each layer for each task. (b) During training, the model searches for the optimal assignment of weights of layers per task-specific network and updates the probability that certain layer weights are used by a specific task. When the same set of weights in a layer is used for training on both tasks, information sharing is achieved. During inference, the most probable assignment of weights in each layer is used for each task-specific network.
  • Figure 4: Samples of action and gesture benchmark datasets. From top to bottom row: UCF-101, NTU-RGB+D, IsoGD, NVGesture.
  • Figure 5: Classes and samples distribution across different multi-task sets of action and gesture classes. The inner circle represents the classes of each set and the outer circle represents the total number of samples used. (a) Set-1 is constructed from the UCF-101 and the IsoGD datasets. (b) Set-2 has samples from UCF-101 and NVGesture. (c) Set-3 consists of action samples from NTU_AR set and gesture samples from NTU_GR_IsoGD. (d) Set-4 is made from samples from NTU_AR and NTU_GR_NVGesture.