Table of Contents
Fetching ...

SATO: Stable Text-to-Motion Framework

Wenshuo Chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen

TL;DR

The paper tackles instability in text-to-motion models when faced with synonym-like perturbations, identifying unstable text-encoder attention as a key cause. It introduces SATO, a plug-and-play framework with stable attention, stable prediction, and an accuracy-robustness trade-off, augmented by a large synonym perturbation dataset derived from HumanML3D and KIT-ML. It defines formal stability criteria, develops a stable attention surrogate loss, and employs PGD/RSR perturbations plus a frozen teacher to balance robustness with original performance. Empirical results show SATO achieves state-of-the-art stability with minimal degradation in original-input quality and improved human preference under perturbations, enabling more reliable real-world text-to-motion systems.

Abstract

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.

SATO: Stable Text-to-Motion Framework

TL;DR

The paper tackles instability in text-to-motion models when faced with synonym-like perturbations, identifying unstable text-encoder attention as a key cause. It introduces SATO, a plug-and-play framework with stable attention, stable prediction, and an accuracy-robustness trade-off, augmented by a large synonym perturbation dataset derived from HumanML3D and KIT-ML. It defines formal stability criteria, develops a stable attention surrogate loss, and employs PGD/RSR perturbations plus a frozen teacher to balance robustness with original performance. Empirical results show SATO achieves state-of-the-art stability with minimal degradation in original-input quality and improved human preference under perturbations, enabling more reliable real-world text-to-motion systems.

Abstract

Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.
Paper Structure (21 sections, 12 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparisons on $FID_D$ and $FID_P$. The closer the model is to the origin, the better. The arrow indicates the effect of our method on the model. Our SATO framework can make the text-to-motion model more stable.
  • Figure 2: Token modification example. In many examples, when the input is perturbed, the model produces an incorrect motion sequence, as shown in the bottom-left figure. When we correct the first erroneous token during the model prediction process, we obtain the correct motion sequence, as depicted in the bottom-right figure. The accuracy of the first token is crucial for the subsequent temporal predictions of the model.
  • Figure 3: (a) Framework of our proposed Stable Text-to-Motion (SATO). It comprises three components: perturbation module, stable attention module, and pretrained teacher model. (b) The perturbation module encompasses two approaches for perturbation, namely Random Synonym Replacement (RSR) and Projected Gradient Descent (PGD). This module is utilized to emulate various perturbations encountered during user interactions. (c) The stable attention module aligns the top-k attention index weights before and after perturbation to stabilize the model's attention distribution. Additionally, we incorporate a frozen teacher module, solely utilized during training, to stabilize the model's motion generation capability, thus balancing the trade-off between accuracy and robustness.
  • Figure 4: Visual results on user testing. SATO (T2M-GPT) refers to fine-tuning based on T2M-GPT to create SATO. Below each action sequence is the corresponding motion caption. The bold text represents the top-k attention weight words. It can be seen that the perturbation of the caption can lead to changes in the attention of the text, which can lead to catastrophic errors in the generative model. SATO has demonstrated superior stability to other models both in terms of attention and motion prediction.
  • Figure 5: Model stability evaluation under different perturbations. It can be observed that across all levels of perturbation, SATO (T2M-GPT) consistently outperforms T2M-GPT in terms of stability metrics. Even when subjected to significant perturbation, our model maintains excellent stability.
  • ...and 3 more figures