Table of Contents
Fetching ...

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Luca Collorone, Matteo Gioia, Massimiliano Pappa, Paolo Leoni, Giovanni Ficarra, Or Litany, Indro Spinelli, Fabio Galasso

TL;DR

MonSTeR addresses the need to evaluate coherence among intention (text), motion, and environment (scene) by learning a unified latent space that captures higher-order interactions among all three modalities. It introduces a tri-modal retrieval framework with unimodal variational encoders and cross-modal encoders, optimized via a six-term contrastive objective to align unimodal and cross-modal representations. The paper demonstrates strong retrieval performance across multiple tasks, validates alignment with human judgments, and shows zero-shot capabilities for in-scene object placement and motion captioning, plus potential as an evaluation tool for Human Scene Interaction models. The approach promisingly advances grounded multimodal reasoning and offers a versatile latent space for both retrieval and generation, with code and models publicly available.

Abstract

Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

TL;DR

MonSTeR addresses the need to evaluate coherence among intention (text), motion, and environment (scene) by learning a unified latent space that captures higher-order interactions among all three modalities. It introduces a tri-modal retrieval framework with unimodal variational encoders and cross-modal encoders, optimized via a six-term contrastive objective to align unimodal and cross-modal representations. The paper demonstrates strong retrieval performance across multiple tasks, validates alignment with human judgments, and shows zero-shot capabilities for in-scene object placement and motion captioning, plus potential as an evaluation tool for Human Scene Interaction models. The approach promisingly advances grounded multimodal reasoning and offers a versatile latent space for both retrieval and generation, with code and models publicly available.

Abstract

Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: MonSTeR can estimate coherence among text, motion, and scene by embedding them into a unified latent space. In the left image, all three modalities are coherent. However, in the right image, this coherence decreases, as there is no chair available in the scene.
  • Figure 2: MonSTeR's Architecture Overview. Each input modality $t,s,m$ is processed by its single-modality encoder. From the first output tokens of $T,M,S$ we sample vectors $v_{t}$, $v_{m}$, and $v_{s}$. The remaining tokens of each encoder's output, namely $\varepsilon_t$, $\varepsilon_m$, $\varepsilon_s$, are pairwise concatenated and passed through cross-modal encoders to generate joint latent vectors ($v_{st}$, $v_{mt}$, $v_{ms}$).
  • Figure 3: Composition of $C_{st,m}$ and $C_{t,m}$ similarity matrices. For $C_{st,m}$, motion latents $v_{m}^{n}$ are compared with the cross-modal scene-text latents $v_{st}^{n}$, while for $C_{t,m}$, they are compared only with text latents $v_{t}^{n}$. Green cells are the locations of similarity scores between positive samples. Our optimization objective promotes assigning higher scores to these cells in all matrices $C_{i,j}$.
  • Figure 4: Qualitative examples for st2m (\ref{['fig:st2m-1']}, \ref{['fig:st2m-2']}, \ref{['fig:st2m-3']}) and ms2t (\ref{['fig:sm2t-1']}, \ref{['fig:sm2t-2']}, \ref{['fig:sm2t-3']}). First, second, and third retrieved samples are shown. In each pictorial, GT=color (top-left corner) indicates the correct corresponding motion (top row) and text (bottom row).
  • Figure 5: MonSTeR 's FID $\downarrow$ (a) and Recall@1 $\uparrow$ (b) trends when motions are increasingly rotated, from $0$ to $\pi$ radians.