Table of Contents
Fetching ...

Guided Attention for Interpretable Motion Captioning

Karim Radouane, Julien Lagarde, Sylvie Ranwez, Andon Tchechmedjiev

TL;DR

A novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms is introduced and methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words are proposed.

Abstract

Diverse and extensive work has recently been conducted on text-conditioned human motion generation. However, progress in the reverse direction, motion captioning, has seen less comparable advancement. In this paper, we introduce a novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms. To encourage human-like reasoning, we propose methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words. We discuss and quantify our model's interpretability using relevant histograms and density distributions. Furthermore, we leverage interpretability to derive fine-grained information about human motion, including action localization, body part identification, and the distinction of motion-related words. Finally, we discuss the transferability of our approaches to other tasks. Our experiments demonstrate that attention guidance leads to interpretable captioning while enhancing performance compared to higher parameter-count, non-interpretable state-of-the-art systems. The code is available at: https://github.com/rd20karim/M2T-Interpretable.

Guided Attention for Interpretable Motion Captioning

TL;DR

A novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms is introduced and methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words are proposed.

Abstract

Diverse and extensive work has recently been conducted on text-conditioned human motion generation. However, progress in the reverse direction, motion captioning, has seen less comparable advancement. In this paper, we introduce a novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms. To encourage human-like reasoning, we propose methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words. We discuss and quantify our model's interpretability using relevant histograms and density distributions. Furthermore, we leverage interpretability to derive fine-grained information about human motion, including action localization, body part identification, and the distinction of motion-related words. Finally, we discuss the transferability of our approaches to other tasks. Our experiments demonstrate that attention guidance leads to interpretable captioning while enhancing performance compared to higher parameter-count, non-interpretable state-of-the-art systems. The code is available at: https://github.com/rd20karim/M2T-Interpretable.
Paper Structure (26 sections, 12 equations, 38 figures, 6 tables)

This paper contains 26 sections, 12 equations, 38 figures, 6 tables.

Figures (38)

  • Figure 1: The encoder branch encodes frame-wise part-based motion representations from joint positions ($X_{ik}$) and velocities ($V_{ik}$), while the decoder branch takes as input (previous token $\hat{y_{t-1}}$, previous state ($h_{t-1}$, $m_{t-1}$)) and estimates the relative importance ($\hat{\beta_t}$ gate) of motion information to consider for word prediction $\hat{y_t}$. Spatial ($\hat{\alpha_{tik}}$) and temporal attention $(\Gamma_{tk})$ are computed from encoded part embeddings $P_{ik}$ and $h_{t}$. The spatio-temporal weights are used to compute the context vector $c_t$ which is then passed to the decoder adaptive gate. $Loss_{lang}$ the cross entropy between predicted, and target words is the main loss. We propose to guide spatial and adaptive attention with $Loss_{spat}$ and $Loss_{adapt}$.
  • Figure 2: $\hat{\beta}$ test set density distribution for a few motion words stems on HumanML3D and the temporal maximum body-parts attention histogram for word "turn".
  • Figure 3: $\hat{\beta}$ density distribution over test set for some non-motion words (stemmed) on HumanML3D.
  • Figure 4: Effect of spatial supervision on HumanML3D across the entire test set for a given motion word (e.g. throw) (# Refer to number of the given motion words).
  • Figure 5: Temporal gaussian window displayed for different motion words given a prediction on KIT-ML.
  • ...and 33 more figures