Table of Contents
Fetching ...

An Evolutionary Network Architecture Search Framework with Adaptive Multimodal Fusion for Hand Gesture Recognition

Yizhang Xia, Shihao Song, Zhanglu Hou, Junwen Xu, Juan Zou, Yuan Liu, Shengxiang Yang

TL;DR

The paper tackles the labor-intensive challenge of manually designing multimodal hand gesture recognition networks by introducing AMF-ENAS, an evolutionary neural architecture search framework that automatically optimizes both where to fuse modalities and how much each modality contributes. It uses a block-based encoding space to efficiently search architectures and a two-stage search (rough on a combined dataset followed by transfer on sub-datasets) to tailor models per dataset, complemented by a novel fusion strategy and an encoding scheme for multimodal data. Empirical results on Ninapro DB2, DB3, and DB7 show state-of-the-art accuracy, with AMF-ENAS significantly outperforming manually designed networks and prior ENAS variants, highlighting the importance of fusion-ratio optimization. The work demonstrates a practical, dataset-adaptive approach to multimodal gesture recognition that reduces manual design effort while delivering robust performance across diverse datasets, and suggests extending the framework to incorporate additional modalities in future work.

Abstract

Hand gesture recognition (HGR) based on multimodal data has attracted considerable attention owing to its great potential in applications. Various manually designed multimodal deep networks have performed well in multimodal HGR (MHGR), but most of existing algorithms require a lot of expert experience and time-consuming manual trials. To address these issues, we propose an evolutionary network architecture search framework with the adaptive multimodel fusion (AMF-ENAS). Specifically, we design an encoding space that simultaneously considers fusion positions and ratios of the multimodal data, allowing for the automatic construction of multimodal networks with different architectures through decoding. Additionally, we consider three input streams corresponding to intra-modal surface electromyography (sEMG), intra-modal accelerometer (ACC), and inter-modal sEMG-ACC. To automatically adapt to various datasets, the ENAS framework is designed to automatically search a MHGR network with appropriate fusion positions and ratios. To the best of our knowledge, this is the first time that ENAS has been utilized in MHGR to tackle issues related to the fusion position and ratio of multimodal data. Experimental results demonstrate that AMF-ENAS achieves state-of-the-art performance on the Ninapro DB2, DB3, and DB7 datasets.

An Evolutionary Network Architecture Search Framework with Adaptive Multimodal Fusion for Hand Gesture Recognition

TL;DR

The paper tackles the labor-intensive challenge of manually designing multimodal hand gesture recognition networks by introducing AMF-ENAS, an evolutionary neural architecture search framework that automatically optimizes both where to fuse modalities and how much each modality contributes. It uses a block-based encoding space to efficiently search architectures and a two-stage search (rough on a combined dataset followed by transfer on sub-datasets) to tailor models per dataset, complemented by a novel fusion strategy and an encoding scheme for multimodal data. Empirical results on Ninapro DB2, DB3, and DB7 show state-of-the-art accuracy, with AMF-ENAS significantly outperforming manually designed networks and prior ENAS variants, highlighting the importance of fusion-ratio optimization. The work demonstrates a practical, dataset-adaptive approach to multimodal gesture recognition that reduces manual design effort while delivering robust performance across diverse datasets, and suggests extending the framework to incorporate additional modalities in future work.

Abstract

Hand gesture recognition (HGR) based on multimodal data has attracted considerable attention owing to its great potential in applications. Various manually designed multimodal deep networks have performed well in multimodal HGR (MHGR), but most of existing algorithms require a lot of expert experience and time-consuming manual trials. To address these issues, we propose an evolutionary network architecture search framework with the adaptive multimodel fusion (AMF-ENAS). Specifically, we design an encoding space that simultaneously considers fusion positions and ratios of the multimodal data, allowing for the automatic construction of multimodal networks with different architectures through decoding. Additionally, we consider three input streams corresponding to intra-modal surface electromyography (sEMG), intra-modal accelerometer (ACC), and inter-modal sEMG-ACC. To automatically adapt to various datasets, the ENAS framework is designed to automatically search a MHGR network with appropriate fusion positions and ratios. To the best of our knowledge, this is the first time that ENAS has been utilized in MHGR to tackle issues related to the fusion position and ratio of multimodal data. Experimental results demonstrate that AMF-ENAS achieves state-of-the-art performance on the Ninapro DB2, DB3, and DB7 datasets.
Paper Structure (20 sections, 1 equation, 6 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the proposed framework for an adaptive multimodal evolutionary network architecture search. (a) Search for suitable multi-modal deep network architecture. (b) Train the searched network. (c) Test the final performance after fine-tuning on the trained network. “Block1” and “Block3” represent fixed block types, while “Blockn” represents an undecided block type.
  • Figure 2: Illustration of the proposed encoding scheme with fixed-length encoding, which each segment of the encoding corresponds to a specific functionality.
  • Figure 3: The process of producing offspring through crossover in the evolutionary process.
  • Figure 4: Illustration of the process of mutation during the evolutionary process.
  • Figure 5: There are six different block structures, with an illustration of the internal structure of each block. (a) Residual block based on ordinary convolution. (b) Residual block based on local convolution. (c) Channel-Attention block (d) Ordinary convolution block. (e) Local convolution block. (f) Spatial-Attention block.
  • ...and 1 more figures