Table of Contents
Fetching ...

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li

TL;DR

A quaternion-enhanced attention network for visual dance synthesis from a quaternion perspective, which consists of a spin position embedding (SPE) module and a quaternion rotary attention (QRA) module, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation.

Abstract

The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.

QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

TL;DR

A quaternion-enhanced attention network for visual dance synthesis from a quaternion perspective, which consists of a spin position embedding (SPE) module and a quaternion rotary attention (QRA) module, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation.

Abstract

The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This can lead to issues like joint deformation, role deviation, floating, and inconsistencies in dance movements generated in response to the music. In this paper, we propose a Quaternion-Enhanced Attention Network (QEAN) for visual dance synthesis from a quaternion perspective, which consists of a Spin Position Embedding (SPE) module and a Quaternion Rotary Attention (QRA) module. First, SPE embeds position information into self-attention in a rotational manner, leading to better learning of features of movement sequences and audio sequences, and improved understanding of the connection between music and dance. Second, QRA represents and fuses 3D motion features and audio features in the form of a series of quaternions, enabling the model to better learn the temporal coordination of music and dance under the complex temporal cycle conditions of dance generation. Finally, we conducted experiments on the dataset AIST++, and the results show that our approach achieves better and more robust performance in generating accurate, high-quality dance movements. Our source code and dataset can be available from https://github.com/MarasyZZ/QEAN and https://google.github.io/aistplusplus_dataset respectively.
Paper Structure (15 sections, 20 equations, 7 figures, 2 tables)

This paper contains 15 sections, 20 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The motivation of our method. We compare the effectiveness of our method is compared with other approaches in generating dance movements from seed motions. In the top row labeled "other methods", two sets of images showcase the transformation of seed movements into unnatural final poses characterized by joint deformation and character drift. Conversely, in the bottom row labeled "our method", we demonstrate how the application of Pre-quaternion parameterization (P), Spin Position Embedding (S), and Quaternion Attention (Q) yields natural-looking final poses. Each prediction produced by our method successfully learns the correlations between dance and music.
  • Figure 2: The overview of our method. (a) describes the basic process, which contains three modules (i), (ii), and (iii). When the inputs are a motion sequence with a length of 120 frames and an audio sequence with a length of 240 frames, features are extracted by the motion transformer and the audio transformer, respectively. The extracted features are parameterized by a quadratic parameterization operation, and the dimension is changed to 4 dimensions. Through the Spin Position Embedding (SPE) module, the corresponding 4-dimensional features are rotated to embed the information into the self-attention in a rotational manner. The information processed by the SPE is used to explore the coordination between the music and the dance through the quaternionic attentional transformer, and finally, the corresponding dance is generated. (i), (ii) and (iii) describe the processing of quaternion parameterization, spin position embedding and the basic structure of the quaternion attention transformer, respectively. Specific details are given in the Methods section.
  • Figure 3: The general situation of Spin Position Embedding. Specifically, the input action sequences and audio sequences in this paper are given feature vector representations after being encoded by their respective Transformers. The feature vectors of the action sequences are $x_m$,and the feature vectors of the audio sequences are $x_n$.These word vectors are then multiplied by different rotation matrices $R_m$, $R_n$ according to their positions m and n in their respective sequences to achieve the positional information of fusion. Finally, the encoded vectors of the action sequences are transformed into query vectors $q_m$,and the rotationally transformed key vectors $k_n$ of the encoded audio sequences are run on a click to compute the correlation between the two modal sequences. With this Spin Position Embedding, the modality can better model the positional information of the two sequences, as well as the correlation between them, thus increasing the learning of cross-modal representations.
  • Figure 4: The overall of our Transformer structure.Our Transformer structure enhances the generalisation ability of the model by adding regularisation means such as Dropout in multiple places and adjusting the number of Attention heads to expand the model capacity on the basis of the original. The absolute position information of the input sequence is converted into a polar coordinate representation of the relative position using Spin Position Embedding, ($\rho$, $\theta$) where $\rho$ denotes the distance from the centre point, $\theta$ denotes the relative angle. This Spin Position Embedding module provides better local relative positions with some rotational invariance. In this way, our model can better support some tasks that are sensitive to position information, such as behavioural sequence modelling and 3D shape analysis.
  • Figure 5: A three-dimensional illustration of a rotated softmax-kernel. The rotated softmax-kernel represents the embeddings in quaternion form and rotates them using the angular frequency $\omega$ .Thus, embeddings with different phases can be distinguished. Finally, the similarity of the rotated embeddings is measured by measuring the exponential dot product between them.
  • ...and 2 more figures