MonSTeR: a Unified Model for Motion, Scene, Text Retrieval
Luca Collorone, Matteo Gioia, Massimiliano Pappa, Paolo Leoni, Giovanni Ficarra, Or Litany, Indro Spinelli, Fabio Galasso
TL;DR
MonSTeR addresses the need to evaluate coherence among intention (text), motion, and environment (scene) by learning a unified latent space that captures higher-order interactions among all three modalities. It introduces a tri-modal retrieval framework with unimodal variational encoders and cross-modal encoders, optimized via a six-term contrastive objective to align unimodal and cross-modal representations. The paper demonstrates strong retrieval performance across multiple tasks, validates alignment with human judgments, and shows zero-shot capabilities for in-scene object placement and motion captioning, plus potential as an evaluation tool for Human Scene Interaction models. The approach promisingly advances grounded multimodal reasoning and offers a versatile latent space for both retrieval and generation, with code and models publicly available.
Abstract
Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.
