Multi-task Learning For Joint Action and Gesture Recognition
Konstantinos Spathis, Nikolaos Kardaris, Petros Maragos
TL;DR
The paper addresses the need for efficient joint recognition of human actions and gestures in human–robot interaction by proposing a multi-task learning framework that shares representations across both tasks. It systematically analyzes weight-sharing strategies (hard, soft with cross-stitch, and learned weight sharing) and multi-task loss formulations, introducing NES-based layer-task assignment for LWS. Across four composite AR/GR datasets, the study demonstrates that most MTL configurations outperform single-task baselines, with cross-stitch style soft sharing and learned sharing delivering the strongest gains, particularly when actions are more prevalent. The findings indicate that MTL improves generalization and efficiency for action and gesture recognition, while also highlighting the impact of dataset composition and the need for task-label supervision in input samples; future work aims to develop a task-agnostic network for even tighter task integration.
Abstract
In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.
