Aligning Actions and Walking to LLM-Generated Textual Descriptions

Radu Chivereanu; Adrian Cosma; Andy Catruna; Razvan Rughinis; Emilian Radoi

Aligning Actions and Walking to LLM-Generated Textual Descriptions

Radu Chivereanu, Adrian Cosma, Andy Catruna, Razvan Rughinis, Emilian Radoi

TL;DR

This work addresses the challenge of aligning motion sequences with textual descriptions by leveraging Large Language Models to generate rich, descriptive captions for both actions and gait appearance. It introduces a CLIP-like framework with a GaitFormer-based motion encoder and a frozen text encoder (UAE-Large-V1), trained using multiple losses including $L_{MSE}$ and $L_{Triplet}$ to align pose embeddings with language embeddings. The authors augment supervision via LLM-generated action descriptions and use DenseGait appearance attributes to produce appearance-driven motion descriptions, enabling retrieval of walking sequences from text. Experiments on BABEL-60 show competitive action recognition performance, while DenseGait-based retrieval demonstrates that appearance-based textual descriptions can guide multi-modal gait retrieval, highlighting a new avenue for appearance-to-motion understanding and data augmentation.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at https://github.com/Radu1999/WalkAndText

Aligning Actions and Walking to LLM-Generated Textual Descriptions

TL;DR

and

to align pose embeddings with language embeddings. The authors augment supervision via LLM-generated action descriptions and use DenseGait appearance attributes to produce appearance-driven motion descriptions, enabling retrieval of walking sequences from text. Experiments on BABEL-60 show competitive action recognition performance, while DenseGait-based retrieval demonstrates that appearance-based textual descriptions can guide multi-modal gait retrieval, highlighting a new avenue for appearance-to-motion understanding and data augmentation.

Abstract

Paper Structure (15 sections, 5 equations, 7 figures, 6 tables)

This paper contains 15 sections, 5 equations, 7 figures, 6 tables.

INTRODUCTION
RELATED WORK
METHOD
Aligning Motion with Text
Datasets
Label Augmentation through Caption Generation
Generating Appearance Descriptions for Walking Sequences
Model Architectures
Training Objectives
Implementation Details
EXPERIMENTS
Generating descriptions for motion
Comparison with Action Recognition
Retrieving Walking Sequences based on Appearance Description
CONCLUSIONS

Figures (7)

Figure 1: Using the expressivity of Large Language Models, we generate rich textual descriptions for motion sequences across actions and walking sequences. Descriptions are further used to align embeddings between text and motion.
Figure 2: Overall diagram of our method. We use a pretrained large language model to generate rich textual description of a motion sequence. This description is used to align motion representations to their natural language descriptions.
Figure 3: Automatic attribute annotation for walking sequences in the DenseGait. Each walking sequence is augmented 5 times and appearance attributes are estimated using an ensemble of three pretrained models. Figure adapted from Cosma and Radoi cosma22gaitformer
Figure 4: Distribution per appearance feature in DenseGait. Figure adapted from Cosma and Radoi cosma22gaitformer
Figure 5: Multi-label classification results for directly predicting the appearance attributes given a walking sequence.
...and 2 more figures

Aligning Actions and Walking to LLM-Generated Textual Descriptions

TL;DR

Abstract

Aligning Actions and Walking to LLM-Generated Textual Descriptions

Authors

TL;DR

Abstract

Table of Contents

Figures (7)